TCP in painful detail (and some SCTP and QUIC as well)heim.ifi.uio.no/michawe/research/publications/tor-vergata_tcp... · Department ofInformatics Networks and Distributed Systems

Department of InformaticsNetworks and Distributed Systems (ND) group

TCP in painful detail(and some SCTP and QUIC as well)

Michael WelzlNet Group, University of Rome Tor Vergata

22. 09. 2017

2

Who am I, to talk about TCP…

• Always worked on transport– Ugly è good playground

• Wrote the only general introductory textbook on congestion control

• Long IETF involvement– Chair of IRTF ICCRG since 2006– A few (tiny!) TCP contributions

3

Why focus on TCP?Let me explain.

è The Internet’s Transport Layerin a Nutshell(sigh, sigh, sigh)

4

Yes,one slide is enough!!!!

• We have,and always had:TCP (reliable bytestream)and UDP(unreliably sending and receiving datagrams)– Because applications are tied to transports,notservices– Trying to fixthis with NEAT project +IETFTAPSWG

• Lotsof applications dowhat they want ontopof UDP– LEDBAT,AdobeRTMFP,GoogleQUIC,...RTP:is it atransport protocol?

• ...and some people say that HTTP(S) is the transport layer.• SCTP is aTCP++,often regarded as too complex to use

– Specialuse cases:telephony signaling, WebRTC data channel• DCCP:cong.control for unreliable transfers – this is really dead• UDP-Lite exists and works,butnobody cares about it• MPTCP is TCP+multihoming (SCTPalsohas multihoming,

butonly with MPTCPit‘s okto use multiplepaths inparallel)

5

TCP, part 1:

1. Early history (up to ~2000s)…so you know why things arethe way they are.

2. Tech details from that time

6

Thebeginning (1970s,up to 1981)

• RFC793,1981,JonPostel– Frontpage says:"DARPAINTERNETPROGRAMPROTOCOLSPECIFICATION"

and "prepared for DefenseAdvanced ResearchProjectsAgency"– Goalwasobviously to make it superreliable:

85pages,coveringmany corner cases

• Robustness is never abad thing,socomplexity has remained– Some things slightly obscure:e.g.,half-closed connections:aftersaying

"FIN",hostcan stillreceive,with no timelimit,until other hostsays "FIN".

7

Some old things are best forgotten

• Push bit (PSH)• Urgent pointer (URG)

• Generally, maybe don’t read RFC 793…– draft-ietf-tcpm-rfc793bis-06:

"This document obsoletes RFC 793, as well as 879, 6093, 6429, 6528, and 6691. It updates RFC 1122 (..) RFC 5961 (..)"

• Also consider: TCP spec roadmap (RFC 7414)– And implementations diverge...

8

Later 80‘s,and 90‘s

• FreeBSDwas"reference"implementation– MatchesStevensbook;patches that were made over timewere

if-clauses inthe code;later revamped– Originally,much code written by people who alsowrote the RFCs

(VanJacobson,SallyFloyd,MarkAllman,MattMathis,..)– Many of them alsowrote ns-2code!– Thesepeople guided the designinthe IETF

• VanJacobson"saved the Internet"aftercongestioncollapse,with code +SIGCOMM1988paper about it:"CongestionAvoidance and Control"

• LinuximplementationalsocommonandwellknowninIETF,completelydifferentcode(segment-,notbyte-based)– FocusofGoogle(morelater)

9

Resulting IETFview

• TCPhas beenworking for along time,solet‘s be carefulèmaking it more robust is okay.– Note:WGiscalled"TCPMaintenanceandMinorExtensions(TCPM)"

• ...and let‘s be careful about congestion control inparticular.• Our "early heroes"did agreat job,sonever breaktheir rules

è important congestion control principles:1. ACKclocking ("conservation of packets"principle)2. Timeoutmeans that the network is empty

• But,note:the IETFalsotries to stay meaningful– Voluntary,we have no "Internetpolice"– We should be happythat companies suchas Google(incaseof

TCP)keep coming to the IETFto tell us what they do

Explained later, but keep them in mind

10

TCP, part 1:

1. Early history (up to ~2000s)

2. Tech details from that timeThe stuff of heroes!

11

What TCPdoes for you (roughly)• UDPfeatures:multiplexing +protection against corruption

– ports,checksum• connection handling

– explicitestablishment +teardown• stream-based in-orderdelivery

– segments are ordered according to sequence numbers– only consecutive bytes are delivered

• reliability– missing segments are detected (ACKis missing)and retransmitted

• flow control– receiver is protected against overload (“sliding window“mechanism)

• congestion control– network is protected against overload (window based)– protocol tries to fill available capacity

• full-duplexcommunication– e.g.,anACKcan be adata segment atthe sametime(piggybacking)

12

TCPHeader

• Flagsindicate connection setup/teardown,ACK,..• If no data:packetis justanACK• Window =advertised window from receiver (flow control)

– Fieldsize limits sending rateintoday‘s highspeed environments;solution:Window Scaling Option – both sides agree to left-shift the window value by Nbit

13

The importance of Window Scaling

• 2007 measurements with Linux– TCP BIC vs. TCP-Reno competition

• OSes gradually increase factor– SIGCOMM 2017 QUIC paper: 4.6% rwnd-limited TCP connections

Local testbed PlanetLab(Austria => Brazil)

14

TCPConnectionManagement

heavy solid line:normal path for a client

heavy dashed line:normal path for a server

Light lines:unusual events

Connectionsetup teardown

3.1. INTRODUCING TCP 59

Window (16 bit): This is the number of bytes (beginning with the one indicated in theacknowledgement field) that the receiver of a data stream is willing to accept. Wewill discuss this further in Section 3.1.3.

Checksum (16 bit): This is a standard Internet checksum (‘Adler-32’, the same algorithmthat is used in IP (Postel 1981a)) covering the TCP header, the complete data anda so-called ‘pseudo header’. The pseudo header is conceptually prefixed to the TCPheader; it contains the source and destination addresses and the protocol number fromthe IP header, a zero field and a virtual ‘TCP length’ field that is computed on thefly. All these fields together clearly identify a TCP packet.

Urgent Pointer (16 bit): See the description of the URG flag above.

Options (variable): This optional field carries TCP options such as ‘MSS’ – this can beused by a receiver to inform the sender of the maximum allowed segment size atconnection setup. We will discuss some other options in the following sections.

Data (variable): This optional field carries the actual application data. An ACK can carrydata in addition to the ACK information (this process is called piggybacking), but inpractice, this is a somewhat rare case that mainly occurs with interactive applications.

3.1.2 Connection handling

In order to reliably transfer a data stream from one end of the network to the other, TCPfirst establishes a logical connection using a common three-way handshake. This procedureis shown in Figure 3.2; first, Host 1 sends a segment with the SYN flag set in order toinitiate a connection. Host 2 replies with a SYN, ACK (both the SYN and ACK flags inthe TCP header are set), and finally, Host 1 acknowledges that the connection is now activewith an ACK. These first segments of a connection also carry the initial sequence numbers,and they can be used for some additional tasks (e.g. determine the MSS via the optionexplained above).

Host 1 Host 2

ACK

SYN, ACK

SYN

(a) Host 1 Host 2

ACK

FIN

FIN

ACK

(b)

Figure 3.2 Connection setup (a) and teardown (b) procedure in TCP

15

Connectionestablishment

• Sequencenumbersynchronization(“SYN”)– avoidmistakingpacketsthatcarrythesamesequencenumberbutdon’t

belongtotheintendedconnection

• TCPSYNsetsupstate(“thatwasthenumber,IsentaSYN/ACK,nowIwaitforaresponse”)– exploitedbySYNfloodDoS attack– ThisiswhydatafromaSYNmustnotbeusedbeforehandshakeisover– Solution:putstateinpackets(“cookie”)

• Canbeimplementedwithoutchangingtheprotocol,byencodingitinsequencenumbers(SYNcookie)

16

Connectionrelease

• Nowaytodoitwithouttimeouts...

17

Errorcontrol:Acknowledgement

• ACK(“positive”Acknowledgement)servesmultiplepurposes:– sender:throw away copy of data held for retransmit– time-out cancelled– msg-number can be re-used

• TCPACKsare cumulative– ACKn acknowledges everythingupton-1

• ACKsshould be delayed (exceptwhensendingduplicates– why?later!)– TCPACKsare unreliable:dropping one does notcause much harm– Enough with1ACKevery 2segments,or atleast1every 500ms (often: 200ms)

• TCPcounts bytes;ACKcarries “next expected byte“(#+1)– Sendersends them as "segments",ideallyof size SMSS– Naglealgorithmdelayssendingtocollectbytes&avoid

sendingtinysegments(canbedisabled)

Following slides: segment numbers for simplicity (imagine 1-byte segments)

18

PathMTUDiscovery(PMTUD)

• (IP)fragmentation=inefficient– Butsmallpacketshavelargeheaderoverhead

• PathMTUDiscoverydeterminesthelargestpacketthatdoesnotgetfragmented– originally(RFC1191,1990):startlarge,reduceuponreceptionofICMP

messageè blackholeproblemifICMPmessagesarefiltered– now(RFC4821,2007):startsmall,increaseaslongastransportlayer

ACKsarriveè transportprotocoldependent("PacketizationLayerPathMTUDiscovery"(PLPMTUD))

• Networklayerfunctionwithtransportlayerdependencies

19

Errorcontrol:Timeout

• Go-Back-Nbehavior inresponsetotimeout

• RetransmitTimeout(RTO) timervaluedifficulttodetermine:– toolongè badincaseofmsg-loss;tooshortè riskoffalsealarms– Generalconsensus:tooshortisworsethantoolong;useconservativeestimate

• Calculation:measureRTT(Seg#...ACK#),then:originalsuggestioninRFC793:ExponentiallyWeighedMovingAverage(EWMA)– SRTT=(1-a)SRTT+a RTT– RTO=min(UBOUND,max(LBOUND,b *SRTT))

• Dependingonvariation,resultmaybetoosmallortoolarge;thus,finalalgorithmincludesvariation(approximatedviameandeviation)– SRTT=(1-a)SRTT+a RTT– d =(1- b)*d +b *[SRTT- RTT]– RTO=SRTT+4*d

That's not how Linux does it

20

RTOcalculation

• Problem:retransmissionambiguity– Segment#1sent,noACKreceivedà segment#1retransmitted– IncomingACK#2:cannotdistinguishwhetheroriginalorretransmittedsegment#1was

ACKed– Thus,cannotreliablycalculateRTO!

• Solution1[Karn/Partridge]:ignoreRTTvaluesfromretransmits– Problem:RTTcalculationespeciallyimportantwhenlossoccurs;samplingtheorem

suggeststhatRTTsamplesshouldbetakenmoreoften

• Solution2:Timestampsoption– Senderwritescurrenttimeintopacketheader(option)– Receiverreflectsvalue– Atsender,whenACKarrives,RTT=(currenttime)- (valuecarriedinoption)– Problems:additionalheaderspace;historical:facilitatesNATdetection

• Note:becauseofhowRTOiscalculated,notmuchgainfromsamplingmorethanonceperRTT

21

Windowmanagement

• Receiver“grants“credit (receiver window,rwnd)– sender restricts sent data with window

• Receiverbuffer notspecified– i.e.receiver may buffer reordered segments (i.e.with gaps)

Sender buffer

22

TCP congestion control: a loop to stabilize(bring it to equilibrium and keep it there)

• Diagrams taken from:V. Jacobson, K. Nichols K. Poduri,"RED in a Different Light", unpublished– "...because someone on the program committee was offended

by the diagram explaining its behavior."https://gettys.wordpress.com/2010/12/17/red-in-a-different-light/

23

TCPCongestionControl:Tahoe

• Distinguish:– flow control:protect receiver against overload

(receiver "grants"acertain amount of data ("receiver window"(rwnd)))– congestion control:protect network against overload

("congestion window"(cwnd)limits the rate:min(cwnd,rwnd) used!)

• Flow/Congestion Control combined inTCP.Two basic algorithms:– SlowStart:start with anInitialWindow (typically 3packets).

For each ack received,increase cwnd by 1packet(exponential growth)until cwnd >=ssthresh threshold

– Congestion Avoidance:each RTT,increase cwnd by atmost 1packet(lineargrowth - "additiveincrease")

24

0

1

2

3

4

5

6

7

8

9

1 3 5 7 9 11 13 15

band

wid

th

time

Timeout

ssthresh

TCPCongestionControl:Tahoe/2

• If apacketor ack is lost(timeout),set cwnd =1,ssthresh =cwnd /2(“multiplicative decrease")- exponential backoff– Note:no strongreasonfor value 2 here

• Actually,“Flightsize/2“instead of cwnd/2becausecwnd might notalways befully used

Slow Start

Congestion Avoidance

25

Background:AIMD10 SCALABLE PERFORMANCE SIGNALLING & CONGESTION AVOIDANCE

User 1 Allocation x1

FairnessLine

EfficiencyLine

User 2Alloca

tion x2

StartingPoint

AIMD

DesirableStartingPoint

AIADMIMD

Figure 2.2. Vector representations of distributed linear control algorithms

there (is stable). It is easy to see that this control is unrealistic for binary feed-back: provided that both users get the same feedback at any time, there is noway for user 1 to interpret the information “there is congestion” or “there is nocongestion” differently than user 2 — but the Desirable vector has a negativex1 component and a positive x2 component.Adding a constant positive or negative factor to a value at the same time

corresponds to moving along at a 45 ◦ angle. This effect is produced by AIAD:both users start at a point underneath the Effi ciency Line and move upward atan angle of 45 ◦. The system ends up in an overloaded state (the state transitionvector passes the Effi ciency Line), which means that it now sends the feedback“there is congestion” to the users. Next, both users decrease their load by aconstant factor, moving back along the same line. With AIAD, there is no wayfor the system to leave this line.The same is true for MIMD, but here, a multiplication by a constant factor

corresponds with moving along an Equi-Fairness Line. AIMD actually ap-proaches perfect fairness and effi ciency, but due to the binary nature of thefeedback, the system can only converge to an equilibrium instead of a stablepoint — it will eventually fluctuate around the optimum. Note that these are by

26

AIMD is an easy way out, but...Michael Welzl, Max Mühlhäuser: "CAVT - A Congestion Avoidance Visualization Tool",ACM SIGCOMM CCR 3(3), July 2003, pp. 95-101.http://heim.ifi.uio.no/michawe/research/tools/cavt/index.html

27

Actual TCP (simulation)

28

FastRetransmit/FastRecovery(FR/FR)(Reno)Reasoning:slow start =restart;assume that network is emptyButeven similar incoming ACKsindicate that packets arrive at the receiver!Thus,slow start reaction =too conservative.

1. Uponreception of third duplicate ACK (DupACK):ssthresh =FlightSize/2

2. Retransmit lostsegment (fastretransmit);cwnd =ssthresh + 3*SMSS("inflates"cwnd by the number of segments (three)that have left the network and which the receiverhas buffered)

3. For each additionalDupACKreceived: cwnd +=SMSS(inflates cwnd to reflect the additionalsegment thathas left the network)

4. Transmit asegment,if allowed by the new value of cwnd and rwnd

5. Uponreception of ACKthat acknowledges new data (“full ACK“):"deflate"window:cwnd =ssthresh (the value set instep 1)

0

1

2

3

4

5

6

7

8

9

1 3 5 7 9 11 13 15

band

wid

th

time Slow Start

CongestionAvoidanceCongestion

AvoidanceSlow Start

Remember: goal is to quickly + correctly reach ssthresh

29

Multipledropped segments• Sendercannot detect loss of multiple

segments from asingle window– Insufficient information inDupACKs

• NewReno:– stay inFR/FRwhen partialACK arrives

afterDupACKs– retransmit single segment– only full ACKends process

• Important to obtain enough ACKstoavoid timeout– Limitedtransmit:alsosendnew

segments for first two DupACKs– Earlyretransmit:resendolddataif

there'snotenoughtosend

78 PRESENT TECHNOLOGY

Sender Receiver

ACK 1

1

2

1 2 3 4 5

3

4

5

ACK 1

ACK 1

ACK 1

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

FR/FR

Figure 3.7 A sequence of events leading to Fast Retransmit/Fast Recovery

actually made it to the receiver arrives. This is the ACK that brings the sender out of fastretransmit/fast recovery mode, and it is caused by the retransmitted segment 1. While thisACK would ideally acknowledge the reception of segments 2 to 5, it will be an ‘ACK 3’ inthe scenario shown in Figure 3.7. This ACK, which covers some but not all of the segmentsthat were sent before entering fast retransmit/fast recovery, is called a partial ACK.

Segment 3 will be retransmitted if another three DupACKs arrive and fast retransmit/fastrecovery is triggered again. The requirement for three incoming DupACKs in response toa single lost segment is problematic at this point. Consider what happens if the advertisedwindow is 10 segments, cwnd is large enough to transmit all of them, and every othersegment in flight is dropped. For all these segments to be recovered using fast retransmit/fastrecovery, a total of 15 DupACKs would have to arrive. Since DupACKs are generated onlywhen segments arrive at the receiver, the sender will not be able to send enough segmentsand reach a point where it waits in vain for DupACKs to arrive. Then, the RTO timer willexpire, which means that the sender will enter slow start mode.

This is undesirable because it renders the connection unnecessarily inefficient: expiry ofthe RTO timer should normally indicate that the ‘pipe’ has emptied, but this is not the casehere – it is just not as full as it would be if only a single segment was dropped from thewindow. The problem is aggravated by the fact that ssthresh is probably very small (e.g.if it was possible to enter fast retransmit/fast recovery several times in a row as describedin (Floyd 1994), ssthresh would be halved each time). Researchers have put significant

30

Selective ACKnowledgements (SACK)

• Example onNewReno slide:sendACK1,SACK3,SACK5inresponse tosegment #4

• Better sender reaction possible– (New)Renocan only retransmit 1segment /window,SACKcan retransmit more– Particularly advantageous when window is large(long fat pipes)

• Extension:DSACK informs the sender of duplicate arrivals• Reaction to SACKopento implementer,butmustfollowgeneral CCrules

• Next:IETF-recommended "conservative"algorithm (RFC6675)• Othervariant:FACK,optionalinLinux;considersall"holes"aslostè retransmit

31

SACK loss recovery: definitions

• HighData: highest seqno transmitted• HighACK: seqno of highest cumulatively ACKed byte• HighRxt: highest seqno retransmitted in this loss recovery phase• Pipe: sender's estimate of the number of bytes outstanding in the

network. Key variable for cc. because now this number is explicit.• DupAcks: # DupACKs received since last cumulative ACK

(DupACK = segment containing SACK block that identifies previously unacked and un-SACKed bytes between HighACK and HighData)

• DupThresh: # DupACKs needed to trigger retransmission (normally 3)• Scoreboard: data structure to keep track of sequence number ranges• RescueRxt: highest seqno which has been optimistically retransmitted

to prevent stalling of the ACK clock when there is loss at the end of the window and no new data is available for transmission.

32

• Update()– Mark all cumulatively ACKed or SACKed bytes, record total # SACKed bytes

• IsLost(SeqNo)– True when either DupThresh discontiguous SACKed sequences have arrived

above SeqNo, or more than (DupThresh - 1) * SMSS bytes with sequence No's > SeqNo have been SACKed

• SetPipe()– pipe = 0– for(S1=HighACK; S1<=HighData; S1++)

• if(scoreboard[S1] unsacked)– if !IsLost(S1):

» pipe++ // not SACKed, not lost è in flight– if S1 <= HighRxt:

» pipe++ // retransmitted, not lost è 2* in flight

SACK loss recovery: scoreboard functions

1 2 X 4 5 X 7 8

IsLost = True

HighACK HighData

HighRXT

pipe++

33

SACK loss recovery: scoreboard functions /2• NextSeg() (what to transmit next)

1. if there exists S2 such that:a) S2 > HighRxtb) S2 < highest byte covered by

any received SACKc) IsLost(S2) == true... return max. SMSS sequence range starting with S2

2. else, if there is unsent data and rwnd allows, return max. SMSS sequence range starting with HighData+1

3. else, if there exists S3 for which 1a) and 1b) are true, return one segment of max. SMSS bytes starting with S3

4. else, if HighACK > RescueRxt (or RescueRxt undefined), return one segment of max. SMSS bytes that includes the highest outstanding unSACKed seq.no, and set RescueRxt to RecoveryPoint (HighData). Do not update HighRxt.

5. else fail (nothing returned)• Rules 3 and 4 are a retransmission "last resort"

1 2 X 4 5 X 7 8

IsLost = True

HighACK HighData

HighRXT

pipe++1. NextSeg

34

When an ACK with SACK info arrives...

• Run Update()• Cumulative ACK? If so, DupAcks = 0• DupACK? If so, and not in FR yet:

– DupAcks++1. If DupAcks >= DupThresh, goto (4)2. If DupAcks < DupThresh but IsLost(HighACK + 1), goto (4)3. Send new segments (Limited Transmit):

1. Set HighRxt to HighACK2. Run SetPipe ()3. If (cwnd - pipe) >= 1 SMSS, there exists previously unsent

data, and rwnd allows, transmit up to 1 SMSS of data starting with the byte HighData+1 and update HighData to reflect this transmission, then return to (3.2)

4. Terminate processing of this ACK

1 2 X 4 5 X 7 8

IsLost = True

HighACK HighData

HighRXT

pipe++

35

...cont'd: entering FR/FR (step 4)

1. RecoveryPoint = HighData2. ssthresh = cwnd = (FlightSize / 2)

(Segments sent as part of Limited Transmit not counted in FlightSize)3. Retransmit the first data segment presumed dropped:

the segment starting with sequence number HighACK + 1.To prevent repeated retransmission of the same data or a premature rescue retransmission, set both HighRxt and RescueRxt to the highest sequence number in the retransmitted segment.

4. Run SetPipe()5. In order to take advantage of potential additional available cwnd,

proceed to transmission step of FR algorithm (next)

Also upon timeout!

36

FR algorithm• ACK arrives: Cumulative ACK for seqno > RecoveryPoint?

– yes: end FR; keep scoreboard info above HighACK– no: run Update() and SetPipe()

• cwnd-pipe >= 1 SMSS? transmit segments!1. Send based on NextSeg() (stop this if NextSeg() fails)2. If any of the bytes sent in 1) are below HighData, set HighRxt to the

highest sequence number of the retransmitted segment unless NextSeg() rule (4) was invoked for this retransmission. (rescueRxt)

3. If any of the bytes sent in 1) are above HighData, update HighData4. pipe += bytes transmitted in 1)5. If cwnd - pipe >= 1 SMSS, return to 1)

37

Spurioustimeouts

• Possible occurrence ine.g.wireless scenarios(handover):sudden delay spike

• Canlead to timeoutà slow start– But:underlying assumption:“pipe empty“is wrong!

(“spurious timeout“)– Oldincoming ACKaftertimeout should be used to

undo the error

• Several methods proposedExamples:– EifelAlgorithm:use timestamps option to check:

timestamp inACK<timeof timeout?– DSACK:duplicate arrived– F-RTO:afterRTO,sendone retransmit,then,if ACK

advances the window,sendnew data;if new datagets ACKed,timeout wasspurious

38

AppropriateByteCounting

• Increasing inCongestion Avoidance mode:common implementation(e.g.Jan’05FreeBSDcode):cwnd +=SMSS*SMSS/cwnd for every ACK(sameas cwnd +=1/cwnd if we count segments)– Problem:e.g.cwnd =2:2+1/2+1/(2+1/2))=2+0.5+0.4=2.9

thus,cannot sendanew packetafter1RTT– Worse with delayed ACKs(cwnd =2.5)– Evenworse with ACKsfor less than 1segment (consider 10001-byteACKs)

à too aggressive!

• Solution:Appropriate ByteCounting (ABC)– Maintain bytes_acked variable;sendsegment when threshold exceeded– WorksinCongestion Avoidance;butwhat about SlowStart?

• Here,ABC+delayed ACKsmeans that the rateincreases in2*SMSSsteps• If aseries of ACKsare dropped,this could be asignificant burst (“micro-

burstiness“);thus,limit of 2*SMSSperACKrecommended

39

Maintaining congestion state• TCP Control Block (TCB): information such as RTO, scoreboard, cwnd, ..

– Related to network path, yet separately stored per TCP connection– Compare: layering problem of PMTU storage

• TCB interdependence: affects initialization phase– Temporal sharing: learn from previous connection

(e.g. for consecutive HTTP requests)– Ensemble sharing: learn from existing connections

here, some information should change -e.g. cwnd could be cwnd/n,n = number of connections; but lessaggressive than "old" implementation

– Known implementations today onlyshare MSS and RTT info

• Related idea: Congestion Manager– One entity in the OS maintains all the– congestion control related state– Used by TCP's and UDP based applications– Hard to implement, not really used

draft-touch-tcpm-2140bis-02.txt

40

End of part 1...

• Note, we focused on what was implemented– and what is still in code, from that time– We also skipped some things: header compression, authentication, sequence

number attacks, implementation specifics (e.g. TCP_NOTSENT_LOWAT)...

• RFC series documents many non-implemented (??) ideas from that time– Being more robust to reordering (TCP-NCR)– Doing congestion control for ACKs (ACK-CC)– Reducing cwnd when the sender doesn't have data to send (CWV)– Adjusting user timeout (when TCP says "it's over") at both ends (UTO)– Avoiding Slow Start overshoot for large windows (Limited Slow-Start)

• ...and some old ideas "took off" later (T/TCP, ECN)...

41

TCPRFCs,status of October 2007

Standards track TCP RFCs thatinfluence when a packet is sent

42

TCP, part 2:

Newer history (since ~2000s)with tech details

("reality check": what do modern TCP implementations looks like)

43

Two needs, one protocol

• Two main requirements have been pushing TCP development ahead over the last 15 years or so:

1. The need to scale to faster networks(large BandwidthXDelay Product (BDP) )

2. The need to reduce latency– Often, for short flows– Google's need...

44

TCP with High-Speed links • TCP over “long fat pipes“: large bandwidth*delay product

– long time to reach equilibrium, MD = problematic– from RFC 3649 (HighSpeed RFC, Experimental):

For example, for a Standard TCP connection with 1500-byte packets and a 100 ms round-trip time, achieving a steady-state throughput of 10 Gbpswould require an average congestion window of 83,333 segments, and a packet drop rate of at most one congestion event every 5,000,000,000 packets (or equivalently, at most one congestion event every 1 2/3 hours). This is widely acknowledged as an unrealistic constraint.

Area:3ct

Area:6ct

Theoretically, utilizationindependent ofcapacity

But: longerconvergence time

45

Slow convergence animation

Slow link

Fast link

46

Some proposed solutions• Standards: window scaling option, TCP SACK

• Scalable TCP: increase/decrease functions changed– cwnd := cwnd + 0.01 for each ack received while not in loss recovery– cwnd := 0.875 * cwnd on each loss event

(probing times proportional to rtt but not rate)

Source: http://www.deneholme.net/tom/scalable/Standard TCP

Scalabe TCP

47

Some proposed solutions /2Rate Standard TCP recovery time Scalable TCP recovery time1Mbps 1.7s 2.7s10Mbps 17s 2.7s100Mbps 2mins 2.7s1Gbps 28mins 2.7s10Gbps 4hrs 43mins 2.7s

• HighSpeed TCP (RFC 3649 includes Scalable TCP discussion):– response function includes a(cwnd) and b(cwnd), which also depend on

loss ratio– less drastic in high bandwidth environments with little loss only

• TCP Westwood(+)– different congestion response function

(proportional to ACK-rate instead of b = 1/2)– Proven to be stable

48

Some proposed solutions /3• FAST TCP

– Variant based on window and delay– Delay allows for earlier adaptation (awareness of growing queue)– Proven to be stable– Commercially announced + patent protected, by Steven Low‘s CalTech group– another delay-based example: TCP Vegas

• Vegas = impractical because less aggressive than standard TCP• More recently: LEDBAT turns this into a benefit

• BIC, CUBIC– BIC (Binary InCrease TCP) uses binary search to find the ideal window size:

• when loss occurs, current window = max, new window = min• check midpoint;• if no loss è new min, increase; else new window = new max

– CUBIC = began as BIC++, uses Cubic increase function + much more...• Backs off as FlightSize * 0.7 (previously FlightSize * 0.8) instead of 0.5

49

And the IETF?• HighSpeed TCP was an RFC;

this made it seem okay to abandon “TCP-friendliness”– Used to be a strong CC requirement: unresponsive traffic easily "kills"

TCP è congestion collapse!– ... but HighSpeed TCP becomes TCP-friendly when

loss is high, so no collapse

• Experimental CCs available in Linux...– ... and some were brought to the IETF.

(Cubic, HTCP, Compound TCP)– But IETF TCPM WG was (and still is) quite busy

• IRTF Internet Congestion Control ResearchGroup (ICCRG) created to “push proposalsover the fence” for a while, until they’re ready”– This is still how this group operates https://clipartfest.com/

50

Meanwhile, in the real world…

51

52

CC implementations are diverging

• BIC became default TCP CC. in Linux in mid-2004– Later replaced with CUBIC (was less aggressive than BIC)

• ... which now also contains Hystart: avoids Slow-Start overshoot– CUBIC Internet-draft recently picked up by other authors, soon RFC

• Note: before doing isolated experiments, look up "Fast Convergence"!

• Compound-TCP (CTCP) = default TCP CC. in Windows Server 2008– For testing purposes; disabled by default in standard release

• Ongoing Google testing with YouTube since 2017:BBR (“Bottleneck Bandwidth and Round-trip time“) CC.– Paced, related to ACK-rate– When saturated, estimate max BW; when unsaturated, est. min RTT

53

Pacing• "Micro burstiness" can lead to

packet drops

• Generally, packet gap dictatedby bottleneck; but incomingstream at bottleneck can bebursty (e.g. from slow start)

• Pacing is hard at high speeds(clock granularity)

• Various solutions– in-network vs. end system– Burst control mechanisms in Linux– "Gap frames" that are later

dropped by a link layer device

54

Proportional Rate Reduction (PRR)• Generalization (any back-off factor) of Linux' Rate-Halving

– Rate-halving avoids burst + pause behavior of FACK or RFC 6675 "conservative loss recovery" algorithm; "paces" segments

– Implements, for FR, common logic: Slow Start when cwnd<ssthresh

Example from RFC6937 (X = lost, N = new, R = retransmit):RFC 6675ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19cwnd: 20 20 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11pipe: 19 19 18 18 17 16 15 14 13 12 11 10 10 10 10 10 10 10 10sent: N N R N N N N N N N N

Rate-Halving (Linux)ack# X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19cwnd: 20 20 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11pipe: 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10sent: N N R N N N N N N N N

55

Active QueueManagement(AQM)• Monitorqueue,notonly drop uponoverflowèmore intelligentdecisions• Goals:keep avgQlow,eliminate phase effects,"punish"aggressiveflows

– Aggressiveflows have more packets inthe queueè dropping randomly is more likely to affect suchflows

– Alsopossible to differentiate traffic (forQoS,e.g.RIO)viadrop function(s)• Old,verywellknownmechanism:RandomEarlyDetection(RED)

– Dropdecisionbasedon:Qavg =(1- Wq)xQavg +Qinst xWq(Qavg =averageoccupancy,Qinst =instantaneousoccupancy,Wq =weight- hardtotune,determineshowaggressivelyREDbehaves)

REDRED RED in "gentle" modeRED in "gentle" mode

RecentAQMs:from"BufferBloat"community:(FQ)-CoDel;fromCisco:PIE

56

Explicit Congestion Notification (ECN)

• Instead of dropping, set a bit– Receiver informs sender about bit– Sender acts as if a packet had been dropped

Actual communication between routers and hosts!

• Deployment problems: ECN useless without AQM, AQMs better with ECN, old broken equipment...– Recently, Apple started to enable ECN, to make the first step

• Alternative Backoff with ECN (ABE)– with modern AQMs, ECN signal not only means "congestion", but

also "the queue was short"– back-off by *0.5 most probably too conservative, can e.g. use *0.8

only in response to ECN. Short queue is no longer a trade-off!

RFC 8087, draft-ietf-tcpm-alternativebackoff-ecnN. Khademi, G. Armitage, M.Welzl, S. Zander, G. Fairhurst, D. Ros: "Alternative Backoff: Achieving Low Latency and High Throughput with ECN and AQM", IFIP NETWORKING 2017 (best paper award).

57

ECN in action

• Nonce (against a lying receiver) was provided by bit combination:– ECT(0): ECT=1, CE=0; ECT(1): ECT=0, CE=1

• Never used... now ECT(1) is officially free to use for experiments– One of them: L4S, uses different feedback and instantaneous queues– Datacenter TCP (DCTCP) idea of: 2 out of 10 packets marked è 1/5 congested

D a t a p a c k e t s

A C K s

S e n d p a c k e t w ithE C T = 1 , C E = 0 ,n o n c e = ra n d o m

E C T = 1 , s o d o n ’t d ro pu p d a te : C E = 1n o n c e = 0

S e t E C E = 1 ins u b s e q u e n t A C K se v e n if C E = 0

R e d u c e c w n d ,s e t C W R = 1

O n ly s e t E C E = 1in A C K s a g a inw h e n C E = 1

S e n d e r R e c e i v e r

1 2 3

4 5

Congestion

58

Next: measures to help short flows

• Short flows are often interactive; latency matters– Large bulk data transfer not usually latency-critical

• Every packet matters:drop è retransmit è user-perceived latency– Good to send much, fast (speed up slow start)– Shaving off round-trips: when all the data can be sent in e.g. 1 RTT,

handshake latency = ½ of the time– Tail loss: FR can't work when no more data to send, hence no more

ACK arrives

• Note: short flows ≈ application-limited flows("thin streams") (also: rwnd-limited flows!)

59

IncreasingtheInitialWindow(IW)• Slowstart: 3RTTsfor 3packets =

inefficient for very short transfers– Example:HTTPRequests

• Thus,initial windowsince~2002:IW=min(4*MSS,max(2*MSS,4380byte))(typically3)

• Since~2013:IW=min(10*MSS,max(2*MSS,14600))(typically10)– Adopted inLinuxas default since kernel 2.6.39

(May2011)– Note:cwnd aftertimeout

("LossWindow"(LW))still1

3.4. TCP CONGESTION CONTROL AND RELIABILITY 71

Sender Receiver

1

0

ACK 1

2

ACK 2

ACK 3

4

5

3

.

.

.

6

(a)Sender Receiver

1

0

ACK 1

2

ACK 2

ACK 3

4

5

3

.

.

.(b)

Figure 3.5 Slow start (a) and congestion avoidance (b)

exactly one segment per RTT in congestion avoidance as in this diagram is an unrealisticsimplification. Theoretically, the ‘Multiplicative Decrease’ part of the congestion avoidancealgorithm comes into play when the RTO timer expires: this is taken as a sign of congestion,and cwnd is halved. Just like the additive increase strategy, this differs substantially fromslow start – yet, both algorithms have their justification and should somehow be includedin TCP.

3.4.2 Combining the algorithms

In order to realize both slow start and congestion avoidance, the two algorithms weremerged into a single congestion control mechanism, which is implemented at the sender asfollows:

• Keep the cwnd variable (initialized to one segment) and a threshold size variableby the name of ssthresh. The latter variable, which may be arbitrarily high at thebeginning according to RFC 2581 (Allman et al. 1999b) but is often set to 64 kB, isused to switch between the two algorithms.

• Always limit the amount of segments that are sent with the minimum of the advertisedwindow and cwnd.

• Upon reception of an ACK, increase cwnd by one segment if it is smaller thanssthresh; otherwise increase it by MSS ∗ MSS/cwnd.

60

TCP Fast Open (TFO)

• Builds on T/TCP idea: allow HTTP GET on SYN, respond with data + SYN/ACK– There, the problem was: DoS attack surface

• Solution:– First handshake like normal, server gives client cookie and

remembers it (locally configurable time)– Later handshakes: SYN + data + cookie

• Remaining problem: server cannot tell original from retransmitted SYN è application must be able to accept duplicate data (changes semantics, also API)– Not a big problem for a web server

61

Tail loss

• Consider the "tail" of a transmission– e.g., segments 8, 9, 10 of a total 10-segment transfer

• Segment 8 lost: we get 2 DupACKs– If we have new data to send, Limited Transmit allows us to do

that (which will give us another DupACK and we can enter FR, where we can retransmit)

– Else, Early Retransmit allows us to resend segment 8

• Segment 10 lost: we get no more ACKs, only the RTO can help us...

62

RTO Restart (RTOR)

• In some cases TCP/SCTP must use RTO for loss recovery– e.g., if a connection has 2

outstanding packets and 1 is lost• However, the effective RTO often

becomes RTO = RTO + t– Where t ≈ RTT [+delACK]

• The reason is that the timer is restarted on each incoming ACK (RFC 6298, RFC 4960)

• RTOR rearms timer as:RTO = RTO - t

Sender Receiver

RTO Restart

RTO

t

Mohammad Rajiullah, Per Hurtig, Anna Brunstrom, Andreas Petlund, Michael Welzl, "An Evaluation of Tail Loss Recovery Mechanisms for TCP", ACM SIGCOMM CCR 45(1), January 2015.RFC 7765

63

Tail Loss Probe (TLP)

• From draft-dukkipati-tcpm-tcp-loss-probe-01:"Measurements on Google Web servers show that approximately 70% of retransmissions for Web transfers are sent after the RTO timer expires, while only 30% are handled by fast recovery.""...distribution of RTO/RTT values on Google Web servers.[percentile, RTO/RTT]: [50th percentile, 4.3]; [75th percentile, 11.3];[90th percentile, 28.9]; [95th percentile, 53.9]; [99th percentile, 214].""... typically caused by variance in measured RTTs..."

• Idea: more aggressive timer allows to send one single packet ("probe") before RTO fires– timer: max(2 * SRTT, 10ms)

(+extra time for DelACK if FlightSize==1)– new, if data available, else resend

64

Recent ACKnowledgment (RACK)

• Main idea: use time instead of sequence numbers(avoid basing logic on DupThresh)– Multiple benefits: eliminates need for much loss recovery logic (drastic

simplification!), works with every packet (also retransmits), ..

• Packet A is lost if some packet B sent sufficiently later is (s)acked– "Sufficiently later": later by at least a "reordering window"

(RACK.reo_wnd, default min_rtt / 4)– min_rtt calc. from RTTs per ACK; tried seeding with SRTT or most

recent RTT, no major difference

• Also: arm a timer to detect loss in case no ACK arrives– TLP is a special case; merged with RACK– Conceptually, RACK arms a (virtual) timer on every

packet sent, times updated with new RTT samples

On by default in Linux!

65

RACK examples: sender sends P1, P2, P3(more than RACK.reo_wnd time in between them)

• Example 1: P1 and P3 lost– P2 SACK arrives è P1 lost, retransmit (R1)– R1 is cumulatively ACKed è P3 lost, retransmit (R3)– No timer needed

• Example 2: P1 and P2 lost– P3 SACK arrives è P1, P2 lost, retransmit (R1, R2)– R1 lost again but R2 SACKed è R1 lost, retransmit– Common with rate limiting from token bucket policers with large

bucket depth and low rate limit– Retransmissions often lost repeatedly because CC. requires

multiple RTTs to reduce the rate below the policed rate

• No DupACK based solution can detect such losses!

66

Performance Enhancing Proxies (PEPs)

• Common; many ideas implemented... how? usually hidden(business secret of "speed up your network" boxes)– Figure: split connection approach: 2a / 2b instead of control loop 1– Many possibilities - e.g. Snoop TCP: monitor + buffer; in case of loss,

suppress DupACKs and retransmit from local buffer

• Proxies assume a certain TCP behavior + packet format– Inhibit further development ("ossification")

67

Multipath TCP (MPTCP)• Many hosts are nowadays multihomed

– Smartphones (WiFi + 3G), data centers– Why not use both connections at once?

• Cannot know where bottleneck is– If it is shared by the two connections, they should

appear (be as aggressive) as only one connection– MPTCP changes congestion avoidance “increase”

parameter to “divide” aggression accordingly• but instead of being “½ TCP”, coupled congestion control:

tries to send as much as possible over least congested path• Least congested = largest window achieved; hence, increase

per path in proportion to window

68

MPTCP /2

• Moving traffic away from congested links achieves “resource pooling”– A web server connected to two 100 Mbit/s links

behaves roughly as if it had one 200 Mbit/s link– Only one host needs to be multi-homed

• Issues– Must look like TCP to work everywhere

Minimal on-the-wire changes: new TCP option– Parallel paths can cause reordering è delay in

handing over data to application on receiver side

69

Part 3:

The Stream Control Transmission Protocol (SCTP)

&QUIC

70

What is SCTP?

• Basically, a TCP++

• Evolved from, and is used for, IP telephony signaling(Now also: SCTP/UDP as WebRTC data channel)– Like TCP: reliable, full-duplex connections– Unlike TCP and UDP: new delivery options that are particularly

desirable for telephony signaling and multimedia applications

• TCP + features– Congestion control similar; some optional TCP mechanisms

mandatory– Two basic types of enhancements: 1) Performance; 2) Robustness

71

SCTP services: SoA TCP + extras

• Services/Features SCTP TCP UDP• Full-duplex data transmission yes yes yes• Connection-oriented yes yes no• Reliable data transfer yes yes no• Unreliable data transfer yes no yes• Partially reliable data transfer yes no no• Ordered data delivery yes yes no• Unordered data delivery yes no yes• Flow and Congestion Control yes yes no• Selective Acks yes yes no• Preservation of message boundaries (ALF) yes no yes• PMTUD yes yes no• Application data fragmentation yes yes no• Multistreaming yes no no• Multihoming yes yes (MPTCP) no• Protection agains SYN flooding attack yes yes (TFO) n/a

For a complete overview, see draft-ietf-taps-transports-usage-08 !

72

Multihoming

• ...at transport layer! (i.e. transparent for apps, such as FTP)

• TCP connection ó SCTP association– 2 IP addresses, 2 port numbers ó 2 sets of IP addresses, 2 port numbers

• Primary goal: robustness– automatically switch hosts upon failure– eliminates effect of long routing reconvergence time– CMT (Concurrent Multipath Transport) never standardized...

• TCP: no “keepalive“ messages when connection idle(but "legal" to implement)

• SCTP monitors reachability via ACKs of data chunks and heartbeat chunks

73

Packet format• Unlike TCP, SCTP provides message-oriented data delivery service

– key enabler for performance enhancements

• Common header; three basic functions:– Source and destination ports together with the IP addresses– Verification tag– Checksum: CRC-32 instead of Adler-32

• followed by one or more chunks– chunk header that identifies length, type, and any special flags– concatenated building blocks containg either control or data information– control chunks transfer information needed for association (connection)

functionality and data chunks carry application layer data.– Current spec: 14 different Control Chunks for association establishment,

termination, ACK, destination failure recovery, error reporting

• Packet can contain several different chunk types• SCTP is extensible

74

Packet 2 Packet 3 Packet 4Packet 1

Application Level Framing (ALF)

• Concept applied in SCTP and e.g. RTP– Byte stream (TCP) inefficient when packets are lost– Application may want logical data units (“chunks“)

Chunk 1 Chunk 2 Chunk 3 Chunk 4

• ALF: app chooses packet size related to chunk size(equal, or whole-numbered multiple, or vice versa)packet 2 lost: no unnecessary data in packet 1,

use chunks 3 and 4 before retrans. 2 arrives

• 1 ADU (Application Data Unit) = multiple chunks=> ALF still more efficient!

75

Unordered delivery & multistreaming

• Decoupling of reliable and ordered delivery– Unordered delivery: eliminate Head-Of-Line (HOL)

blocking delay

Chunk 2

Chunk 3

Chunk 4

Chunk 1

TCP receiver buffer

App waits in vain!

• Support for multiple data streams(per-stream ordered delivery)- Stream sequence number (SSN)

preserves order within streams- no order preserved between streams

Stream mux =messages + identifier!

76

Multiple Data Streams• Application may use multiple logical data streams

– e.g. pictures in a web browser• Common solution: multiple TCP connections

– separate flow / congestion control, overhead (connection setup/teardown, ..)



App stream 1

App stream 2

TCP sender

Chunk 11

Chunk 12

Chunk 23

Chunk 24

Chunk 11

Chunk 24

Chunk 23

Chunk 12

TCP receiver

App 1 waits in vain!

77

Side note: stream multiplexing and HTTP• HTTP/1.1 persistent connections:

Multiplexing as on previous slide– Benefit: allows TCP to increase its window further

(most web flows terminate in slow start)– Disadvantage: Transport-layer HOL delay– Worse yet: in HTTP/1.1, client determines sequence– Client does not know which requests take long

(e.g. database lookups, ..); can cause app-layer HOL delay !• HTTP/2 multiplexes frames onto 1 TCP connection, can accept

them out of order– TCP can still cause RTT-timescale HOL blocking delay– Google’s solution: QUIC / UDP

• Note: multiple TCP connections are also more aggressive and less "fragile" than one (1 TCP: single loss = cwnd drop)... may need more aggressive CC to get benefit with a single connection

78

QUIC

• Userspace transport over UDP• Used in Google Chrome & Google servers

(2017 estimate: 30% of Google's total egress traffic, 7% of global Internet traffic)– Monolithic implementation taken to IETF, getting modularized;

ongoing work

• Designed for HTTP/2, and covering TLS– 0-RTT crypto handshake: like TFO, but including security

information on the first packet as well– Not limited to "content of single TCP SYN packet"– IETF version uses TLS 1.3

79

QUIC /2

• Everything authenticated, mostly encrypted– Encryption avoids ossification– Unencrypted part of header carefully designed

• Multi-streaming(avoids transport-layer HOL blocking delay)– Every stream is a reliable bidirectional bytestream

(But: no retransmission for cancelled streams)– Implemented via stream frames, multiple per packet possible– Connection- and stream-level credit-based flow control

(similar to HTTP/2)

• Version negotiation

80

QUIC /3

• Avoid TCP RTT calc. & loss recovery problems– Avoid retransmission ambiguity: never re-use Seq.No's– ACKs encode delay between receiving and ACKing– ACKs support up to 256 ACK blocks

• Congestion control: Cubic (but slightly more aggressive),soon probably BBR, standard says (will say) Reno

• Mobility and NAT rebinding support: 64-bit connection ID

• Discovery (learn about QUIC via "Alt-Svc" header in HTTP response via TLS/TCP, then race QUIC + TCP, prefer QUIC by delaying TCP connections, cache errors ("Happy Eyeballs")

81

Conclusion

82

Transport: more interesting again!

• TCP recently changed a lot– AQM + ECN: finally getting used ??– MPTCP– Diversity of congestion controls in use:

Reno, Cubic, BBR, MPTCP ...• Non-TCP CC: delay-based-but-reasonably-competing in

WebRTC, LBE (LEDBAT) in BitTorrent...• SCTP used as always + in WebRTC data channel• QUIC in Chrome + Google servers, much industry interest

• Ongoing efforts to clean up this mess– TAPS IETF WG, NEAT EC project, Apple's post-sockets code...

83

Thank you!

84

Backup slides

Limited Slow StartTCP-NCR

85

Limited Slow Start

• Slow start problems:– initial ssthresh = constant, not related to real network

this is especially severe when cwnd and ssthresh are very large• Proposals to initially adjust ssthresh failed: must be quick and precise

– Assume: cwnd and ssthresh are large, and avail.bw. = current window + 1 packet?

• Next updates (cwnd++ for every ACK) will cause many packet drops

• Solution: Limited Slow Start– cwnd <= max_ssthresh: normal operation;

recommend. max_ssthresh=100 SMSS– else K = int(cwnd/(0.5*max_ssthresh), cwnd += int(MSS/K)– More conservative than Slow Start:

for a while cwnd+=MSS/2, then cwnd+=MSS/3, etc.

86

Non-Congestion Robustness (NCR)

• Assumption: 3 DupACKs clearly indicate loss– Can be incorrect when packets are reordered

• Reordering is not rare– And new mechanisms in the network could be devised if TCP was robust against

reordering (e.g. consider splitting a flow on multiple paths)

• Approach: Increase the number of DupACKs N to approx. 1 cwnd– Alternative proposal: making that number adaptive, based on scoreboard info

• Extended Limited Transmit; 2 variants– Careful Limited Transmit: send 1 new packet for every other DupACK until N is

reached (halve sending rate, but send new data for a while)– Aggressive Limited Transmit: send 1 new packet for every DupACK until N is

reached (delay halving sending rate)– Full ACK ends process

Documents

TCP in painful detail (and some SCTP and QUIC as well)heim.ifi.uio.no/michawe/research/publications/tor-vergata_tcp... · Department ofInformatics Networks and Distributed Systems