64
Transport Layer and Data Center TCP Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking September 5, 2014 d and adapted judiciously from Computer Networking, A Top-Dow

Transport Layer and Data Center TCP

  • Upload
    todd

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Transport Layer and Data Center TCP. Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking September 5, 2014. Slides used and adapted judiciously from Computer Networking, A Top-Down Approach. Goals for Today. Transport Layer - PowerPoint PPT Presentation

Citation preview

Page 1: Transport Layer and Data Center TCP

Transport Layer and Data Center TCP

Hakim WeatherspoonAssistant Professor Dept of Computer Science

CS 5413 High Performance Systems and NetworkingSeptember 5 2014

Slides used and adapted judiciously from Computer Networking A Top-Down Approach

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

provide logical communication between app processes running on different hosts

transport protocols run in end systems send side breaks app

messages into segments passes to network layer

rcv side reassembles segments into messages passes to app layer

more than one transport protocol available to apps Internet TCP and UDP

applicationtransportnetworkdata linkphysical

logical end-end transport

applicationtransportnetworkdata linkphysical

Transport Layer ServicesProtocols

Transport Layer ServicesProtocols

network layer logical communication between hosts

transport layer logical communication between processes relies on enhances

network layer services

12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house

bull hosts = housesbull processes = kidsbull app messages = letters in

envelopesbull transport protocol = Ann

and Bill who demux to in-house siblings

bull network-layer protocol = postal service

household analogy

Transport vs Network Layer

bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of

ldquobest-effortrdquo IPbull services not available

ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

logical end-end transport

Transport Layer ServicesProtocols

TCP servicebull reliable transport between

sending and receiving process

bull flow control sender wonrsquot overwhelm receiver

bull congestion control throttle sender when network overloaded

bull does not provide timing minimum throughput guarantee security

bull connection-oriented setup required between client and server processes

UDP servicebull unreliable data transfer

between sending and receiving process

bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup

Q why bother Why is there a UDP

Transport Layer ServicesProtocols

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 2: Transport Layer and Data Center TCP

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

provide logical communication between app processes running on different hosts

transport protocols run in end systems send side breaks app

messages into segments passes to network layer

rcv side reassembles segments into messages passes to app layer

more than one transport protocol available to apps Internet TCP and UDP

applicationtransportnetworkdata linkphysical

logical end-end transport

applicationtransportnetworkdata linkphysical

Transport Layer ServicesProtocols

Transport Layer ServicesProtocols

network layer logical communication between hosts

transport layer logical communication between processes relies on enhances

network layer services

12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house

bull hosts = housesbull processes = kidsbull app messages = letters in

envelopesbull transport protocol = Ann

and Bill who demux to in-house siblings

bull network-layer protocol = postal service

household analogy

Transport vs Network Layer

bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of

ldquobest-effortrdquo IPbull services not available

ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

logical end-end transport

Transport Layer ServicesProtocols

TCP servicebull reliable transport between

sending and receiving process

bull flow control sender wonrsquot overwhelm receiver

bull congestion control throttle sender when network overloaded

bull does not provide timing minimum throughput guarantee security

bull connection-oriented setup required between client and server processes

UDP servicebull unreliable data transfer

between sending and receiving process

bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup

Q why bother Why is there a UDP

Transport Layer ServicesProtocols

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 3: Transport Layer and Data Center TCP

provide logical communication between app processes running on different hosts

transport protocols run in end systems send side breaks app

messages into segments passes to network layer

rcv side reassembles segments into messages passes to app layer

more than one transport protocol available to apps Internet TCP and UDP

applicationtransportnetworkdata linkphysical

logical end-end transport

applicationtransportnetworkdata linkphysical

Transport Layer ServicesProtocols

Transport Layer ServicesProtocols

network layer logical communication between hosts

transport layer logical communication between processes relies on enhances

network layer services

12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house

bull hosts = housesbull processes = kidsbull app messages = letters in

envelopesbull transport protocol = Ann

and Bill who demux to in-house siblings

bull network-layer protocol = postal service

household analogy

Transport vs Network Layer

bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of

ldquobest-effortrdquo IPbull services not available

ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

logical end-end transport

Transport Layer ServicesProtocols

TCP servicebull reliable transport between

sending and receiving process

bull flow control sender wonrsquot overwhelm receiver

bull congestion control throttle sender when network overloaded

bull does not provide timing minimum throughput guarantee security

bull connection-oriented setup required between client and server processes

UDP servicebull unreliable data transfer

between sending and receiving process

bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup

Q why bother Why is there a UDP

Transport Layer ServicesProtocols

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 4: Transport Layer and Data Center TCP

Transport Layer ServicesProtocols

network layer logical communication between hosts

transport layer logical communication between processes relies on enhances

network layer services

12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house

bull hosts = housesbull processes = kidsbull app messages = letters in

envelopesbull transport protocol = Ann

and Bill who demux to in-house siblings

bull network-layer protocol = postal service

household analogy

Transport vs Network Layer

bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of

ldquobest-effortrdquo IPbull services not available

ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

logical end-end transport

Transport Layer ServicesProtocols

TCP servicebull reliable transport between

sending and receiving process

bull flow control sender wonrsquot overwhelm receiver

bull congestion control throttle sender when network overloaded

bull does not provide timing minimum throughput guarantee security

bull connection-oriented setup required between client and server processes

UDP servicebull unreliable data transfer

between sending and receiving process

bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup

Q why bother Why is there a UDP

Transport Layer ServicesProtocols

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 5: Transport Layer and Data Center TCP

bull reliable in-order delivery (TCP)ndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of

ldquobest-effortrdquo IPbull services not available

ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

logical end-end transport

Transport Layer ServicesProtocols

TCP servicebull reliable transport between

sending and receiving process

bull flow control sender wonrsquot overwhelm receiver

bull congestion control throttle sender when network overloaded

bull does not provide timing minimum throughput guarantee security

bull connection-oriented setup required between client and server processes

UDP servicebull unreliable data transfer

between sending and receiving process

bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup

Q why bother Why is there a UDP

Transport Layer ServicesProtocols

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 6: Transport Layer and Data Center TCP

TCP servicebull reliable transport between

sending and receiving process

bull flow control sender wonrsquot overwhelm receiver

bull congestion control throttle sender when network overloaded

bull does not provide timing minimum throughput guarantee security

bull connection-oriented setup required between client and server processes

UDP servicebull unreliable data transfer

between sending and receiving process

bull does not provide reliability flow control congestion control timing throughput guarantee security or connection setup

Q why bother Why is there a UDP

Transport Layer ServicesProtocols

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 7: Transport Layer and Data Center TCP

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 8: Transport Layer and Data Center TCP

process

socket

use header info to deliverreceived segments to correct socket

demultiplexing at receiverhandle data from multiplesockets add transport header (later used for demultiplexing)

multiplexing at sender

transport

application

physical

link

network

P2P1

transport

application

physical

link

network

P4

transport

application

physical

link

network

P3

Transport LayerSockets MultiplexingDemultiplexing

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 9: Transport Layer and Data Center TCP

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 10: Transport Layer and Data Center TCP

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

no connection establishment (which can add delay)

simple no connection state at sender receiver

small header size no congestion control UDP

can blast away as fast as desired

why is there a UDP

UDP Connectionless TransportUDP Segment Header

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 11: Transport Layer and Data Center TCP

UDP Connectionless Transport

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (onersquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of received

segmentbull check if computed checksum

equals checksum field value

ndash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

UDP Checksum

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 12: Transport Layer and Data Center TCP

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

bull Congestion control

bull Data Center TCPndash Incast Problem

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 13: Transport Layer and Data Center TCP

important in application transport link layers top-10 list of important networking topics

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 14: Transport Layer and Data Center TCP

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 15: Transport Layer and Data Center TCP

characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

important in application transport link layers top-10 list of important networking topics

Principles of Reliable Transport

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 16: Transport Layer and Data Center TCP

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Principles of Reliable Transport

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 17: Transport Layer and Data Center TCP

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 18: Transport Layer and Data Center TCP

source port dest port

32 bits

applicationdata (variable length)

sequence number

acknowledgement number

receive window

Urg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now(generally not used)

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP Reliable TransportTCP Segment Structure

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 19: Transport Layer and Data Center TCP

sequence numbers

ndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgements

ndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segments

ndashA TCP spec doesnrsquot say - up to implementor

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window size N

sender sequence number space

source port dest port

sequence number

acknowledgement number

checksum

rwnd

urg pointer

outgoing segment from sender

TCP Reliable TransportTCP Sequence numbers and Acks

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 20: Transport Layer and Data Center TCP

Usertypes

lsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP Reliable TransportTCP Sequence numbers and Acks

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 21: Transport Layer and Data Center TCP

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 22: Transport Layer and Data Center TCP

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing

to establish connection)bull agree on connection parameters

connection state ESTABconnection variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

connection state ESTABconnection Variables

seq client-to-server server-to-clientrcvBuffer size at serverclient

application

network

Socket clientSocket = newSocket(hostnameport

number)

Socket connectionSocket = welcomeSocketaccept()

Connection Management TCP 3-way handshake

TCP Reliable Transport

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 23: Transport Layer and Data Center TCP

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data

received ACK(y) indicates client is live

SYNSENT

ESTAB

SYN RCVD

client state

LISTEN

server state

LISTEN

TCP Reliable TransportConnection Management TCP 3-way handshake

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 24: Transport Layer and Data Center TCP

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport

number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)

SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)

ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP Reliable TransportConnection Management TCP 3-way handshake

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 25: Transport Layer and Data Center TCP

client server each close their side of connection send TCP segment with FIN bit = 1

respond to received FIN with ACK on receiving FIN ACK can be combined with

own FINsimultaneous FIN exchanges can be

handled

TCP Reliable TransportConnection Management Closing connection

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 26: Transport Layer and Data Center TCP

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1 wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but can receive data

clientSocketclose()

client state server state

ESTABESTAB

TCP Reliable TransportConnection Management Closing connection

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 27: Transport Layer and Data Center TCP

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 28: Transport Layer and Data Center TCP

data rcvd from appcreate segment with seq

seq is byte-stream

number of first data byte in segment

start timer if not already running think of timer as for

oldest unacked segment expiration interval

TimeOutInterval

timeoutretransmit segment that

caused timeoutrestart timer ack rcvdif ack acknowledges

previously unacked segments update what is known to

be ACKed start timer if there are

still unacked segments

TCP Reliable Transport

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 29: Transport Layer and Data Center TCP

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtim

eout

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

tim

eout

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP Reliable TransportTCP Retransmission Scenerios

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 30: Transport Layer and Data Center TCP

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

tim

eout

Seq=100 20 bytes of data

ACK=120

TCP Reliable TransportTCP Retransmission Scenerios

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 31: Transport Layer and Data Center TCP

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACK indicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP ACK generation [RFC 1122 2581]Reliable Transport

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 32: Transport Layer and Data Center TCP

time-out period often relatively long long delay before

resending lost packetdetect lost segments via

duplicate ACKs sender often sends many

segments back-to-back if segment is lost there

will likely be many duplicate ACKs

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo) resend unacked segment with smallest seq

likely that unacked segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

TCP Reliable TransportTCP Fast Retransmit

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 33: Transport Layer and Data Center TCP

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

tim

eout

ACK=100

ACK=100

ACK=100

Seq=100 20 bytes of data

Seq=100 20 bytes of data

TCP Reliable TransportTCP Fast Retransmit

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 34: Transport Layer and Data Center TCP

Q how to set TCP timeout value

longer than RTT but RTT varies

too short premature timeout unnecessary retransmissions

too long slow reaction to segment loss

Q how to estimate RTT

bull SampleRTT measured time from segment transmission until ACK receipt

ndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquo

ndash average several recent measurements not just current SampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 35: Transport Layer and Data Center TCP

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

EstimatedRTT = (1- )EstimatedRTT + SampleRTT exponential weighted moving average influence of past sample decreases

exponentially fast typical value = 0125

RTT (

mill

iseco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTT

TCP Reliable TransportTCP Roundtrip time and timeouts

time (seconds)

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 36: Transport Layer and Data Center TCP

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT -gt larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT DevRTT = (1-)DevRTT + |SampleRTT-EstimatedRTT|

(typically = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP Reliable TransportTCP Roundtrip time and timeouts

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 37: Transport Layer and Data Center TCP

full duplex data bi-directional data flow in

same connection MSS maximum segment

sizeconnection-oriented

handshaking (exchange of control msgs) inits sender receiver state before data exchange

flow controlled sender will not overwhelm

receiver

bull point-to-pointndash one sender one

receiver bull reliable in-order

byte steamndash no ldquomessage

boundariesrdquobull pipelined

ndash TCP congestion and flow control set window size

TCP Transmission Control ProtocolRFCs 79311221323 2018 2581

TCP Reliable Transport

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 38: Transport Layer and Data Center TCP

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

application

OS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP Reliable TransportFlow Control

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 39: Transport Layer and Data Center TCP

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application processbull receiver ldquoadvertisesrdquo free

buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via

socket options (typical default is 4096 bytes)

ndash many operating systems autoadjust RcvBuffer

bull sender limits amount of unacked (ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

receiver-side buffering

TCP Reliable TransportFlow Control

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 40: Transport Layer and Data Center TCP

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 41: Transport Layer and Data Center TCP

congestionbull informally ldquotoo many sources sending too

much data too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

Principles of Congestion Control

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 42: Transport Layer and Data Center TCP

two broad approaches towards congestion control

end-end congestion control

no explicit feedback from network

congestion inferred from end-system observed loss delay

approach taken by TCP

network-assisted congestion control

routers provide feedback to end systemssingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

explicit rate for sender to send at

Principles of Congestion Control

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 43: Transport Layer and Data Center TCP

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

TCP connection 1

bottleneckroutercapacity RTCP connection 2

TCP Congestion ControlTCP Fairness

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 44: Transport Layer and Data Center TCP

approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurs additive increase increase cwnd by 1

MSS every RTT until loss detected multiplicative decrease cut cwnd in half

after loss

cwnd

TC

P s

end

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion ControlTCP Fairness Why is TCP Fair AIMD additive increase multiplicative decrease

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 45: Transport Layer and Data Center TCP

sender limits transmission

cwnd is dynamic function of perceived network congestion

TCP sending rateroughly send cwnd

bytes wait RTT for ACKS then send more bytes

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwnd

RTTbytessec

TCP Congestion Control

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 46: Transport Layer and Data Center TCP

two competing sessions additive increase gives slope of 1 as throughout increases multiplicative decrease decreases throughput proportionally

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

h pu t

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

TCP Congestion ControlTCP Fairness Why is TCP Fair

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 47: Transport Layer and Data Center TCP

Fairness and UDPmultimedia apps

often do not use TCP do not want rate

throttled by congestion control

instead use UDP send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

application can open multiple parallel connections between two hosts

web browsers do this eg link of rate R with 9

existing connections new app asks for 1 TCP gets rate R10 new app asks for 11 TCPs gets R2

TCP Congestion ControlTCP Fairness

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 48: Transport Layer and Data Center TCP

when connection begins increase rate exponentially until first loss event initially cwnd = 1 MSS double cwnd every RTT done by incrementing cwnd

for every ACK receivedsummary initial rate is

slow but ramps up exponentially fast

Host A

one segment

RT

T

Host B

time

two segments

four segments

TCP Congestion ControlSlow Start

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 49: Transport Layer and Data Center TCP

TCP Congestion Control

loss indicated by timeout cwnd set to 1 MSS window then grows exponentially (as in slow start) to threshold then grows linearly

loss indicated by 3 duplicate ACKs TCP RENO dup ACKs indicate network capable of delivering some segments cwnd is cut in half window then grows linearly

TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Detecting and Reacting to Loss

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 50: Transport Layer and Data Center TCP

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Implementation variable ssthresh on loss event ssthresh is

set to 12 of cwnd just before loss event

TCP Congestion ControlSwitching from Slow Start to Congestion Avoidance (CA)

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 51: Transport Layer and Data Center TCP

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++

duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++

duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP Congestion Control

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 52: Transport Layer and Data Center TCP

bull avg TCP thruput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg thruput is 34W per RTT

W

W2

avg TCP thruput = 34WRTTbytessec

TCP Throughput

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 53: Transport Layer and Data Center TCP

TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 54: Transport Layer and Data Center TCP

Goals for Todaybull Transport Layer

ndash Abstraction servicesndash MultiplexingDemultiplexingndash UDP Connectionless Transportndash TCP Reliable Transport

bull Abstraction Connection Management Reliable Transport Flow Control timeouts

ndash Congestion control

bull Data Center TCPndash Incast Problem

Slides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 55: Transport Layer and Data Center TCP

TCP Throughput CollapseWhat happens when TCP is ldquotoo friendlyrdquoEgbull Test on an Ethernet-based storage cluster

bull Client performs synchronized reads

bull Increase of servers involved in transferndash SRU size is fixed

bull TCP used as the data transfer protocolSlides used judiciously from ldquoMeasurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systemsrdquo A Phanishayee E Krevat V Vasudevan D G Andersen G R Ganger G A Gibson and S Seshan Proc of USENIX File and Storage Technologies (FAST) February 2008

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 56: Transport Layer and Data Center TCP

Cluster-based Storage Systems

Client Switch

Storage Servers

RR

RR

1

2

Data Block

Server Request Unit(SRU)

3

4

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 57: Transport Layer and Data Center TCP

Link idle time due to timeouts

Client Switch

RR

RR

1

2

3

4

Synchronized Read

4

Link is idle until server experiences a timeout

1 2 3 4 Server Request Unit(SRU)

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 58: Transport Layer and Data Center TCP

TCP Throughput Collapse Incast

bull [Nagle04] called this Incastbull Cause of throughput collapse TCP timeouts

Collapse

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 59: Transport Layer and Data Center TCP

TCP data-driven loss recovery

Sender Receiver

123

4

5

Ack 1

Ack 1

Ack 1

Ack 1

3 duplicate ACKs for 1(packet 2 is probably lost)

2

Seq

Retransmit packet 2 immediately

In SANsrecovery in usecsafter loss

Ack 5

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 60: Transport Layer and Data Center TCP

TCP timeout-driven loss recovery

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

bull Timeouts are expensive(msecs to recover after loss)

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 61: Transport Layer and Data Center TCP

TCP Loss recovery comparison

Sender Receiver

12345

Ack 1

Ack 1Ack 1Ack 1

Retransmit 2

Seq

Ack 5

Sender Receiver

123

4

5

1

RetransmissionTimeout(RTO)

Ack 1

Seq

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (us) in SANs

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 62: Transport Layer and Data Center TCP

TCP Throughput Collapse Summary

bull Synchronized Reads and TCP timeouts cause TCP Throughput Collapse

bull Previously tried optionsndash Increase buffer size (costly)ndash Reduce RTOmin (unsafe)ndash Use Ethernet Flow Control (limited applicability)

bull DCTCP (Data Center TCP)ndash Limited in-network buffer (queue length) via both in-

network signaling and end-to-end TCP modifications

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 63: Transport Layer and Data Center TCP

principles behind transport layer services multiplexing demultiplexing reliable data transfer flow control congestion control

instantiation implementation in the Internet UDP TCP

Next timebull Network Layerbull leaving the network

ldquoedgerdquo (application transport layers)

bull into the network ldquocorerdquo

Perspective

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time
Page 64: Transport Layer and Data Center TCP

Before Next time

bull Project Proposalndash due in one weekndash Meet with groups TA and professor

bull Lab1ndash Single threaded TCP proxyndash Due in one week next Friday

bull No required reading and review duebull But review chapter 4 from the book Network Layer

ndash We will also briefly discuss data center topologiesndash

bull Check website for updated schedule

  • Transport Layer and Data Center TCP
  • Goals for Today
  • Transport Layer ServicesProtocols
  • Transport Layer ServicesProtocols (2)
  • Transport Layer ServicesProtocols (3)
  • Transport Layer ServicesProtocols (4)
  • Goals for Today (2)
  • Transport Layer
  • Goals for Today (3)
  • UDP Connectionless Transport
  • UDP Connectionless Transport (2)
  • Goals for Today (4)
  • Principles of Reliable Transport
  • Principles of Reliable Transport (2)
  • Principles of Reliable Transport (3)
  • Principles of Reliable Transport (4)
  • TCP Reliable Transport
  • TCP Reliable Transport (2)
  • TCP Reliable Transport (3)
  • TCP Reliable Transport (4)
  • TCP Reliable Transport (5)
  • TCP Reliable Transport (6)
  • TCP Reliable Transport (7)
  • TCP Reliable Transport (8)
  • TCP Reliable Transport (9)
  • TCP Reliable Transport (10)
  • TCP Reliable Transport (11)
  • TCP Reliable Transport (12)
  • TCP Reliable Transport (13)
  • TCP Reliable Transport (14)
  • Reliable Transport
  • TCP Reliable Transport (15)
  • TCP Reliable Transport (16)
  • TCP Reliable Transport (17)
  • TCP Reliable Transport (18)
  • TCP Reliable Transport (19)
  • TCP Reliable Transport (20)
  • TCP Reliable Transport (21)
  • TCP Reliable Transport (22)
  • Goals for Today (5)
  • Principles of Congestion Control
  • Principles of Congestion Control (2)
  • TCP Congestion Control
  • TCP Congestion Control (2)
  • TCP Congestion Control (3)
  • TCP Congestion Control (4)
  • TCP Congestion Control (5)
  • TCP Congestion Control (6)
  • TCP Congestion Control (7)
  • TCP Congestion Control (8)
  • TCP Congestion Control (9)
  • TCP Throughput
  • TCP over ldquolong fat pipesrdquo
  • Goals for Today (6)
  • TCP Throughput Collapse
  • Cluster-based Storage Systems
  • Link idle time due to timeouts
  • TCP Throughput Collapse Incast
  • TCP data-driven loss recovery
  • TCP timeout-driven loss recovery
  • TCP Loss recovery comparison
  • TCP Throughput Collapse Summary
  • Perspective
  • Before Next time