107
Transport Protocol Design: UDP, TCP Ali El-Mousa Based in part upon slides of Prof. Raj Jain (OSU), Srini Seshan (CMU), J. Kurose (U Mass), I.Stoica (UCB) 1

Transport Protocol Design: UDP, TCP

  • Upload
    ardith

  • View
    53

  • Download
    1

Embed Size (px)

DESCRIPTION

Transport Protocol Design: UDP, TCP. Ali El-Mousa Based in part upon slides of Prof. Raj Jain (OSU), Srini Seshan (CMU), J. Kurose (U Mass), I.Stoica (UCB). Overview. UDP: connectionless, end-to-end service UDP Servers TCP features, Header format Connection Establishment - PowerPoint PPT Presentation

Citation preview

Transport Protocol Design: UDP, TCPAli El-Mousa

Based in part upon slides of Prof. Raj Jain (OSU), Srini Seshan (CMU), J. Kurose (U Mass), I.Stoica (UCB)1UDP: connectionless, end-to-end serviceUDP ServersTCP features, Header formatConnection EstablishmentConnection Termination TCP Server DesignRef: Chap 11, 17,18; RFC 793, 1323Overview2Transport ProtocolsProtocol implemented entirely at the endsFate-sharingCompleteness/correctness of function implementations

UDP provides just integrity and demuxTCP addsConnection-orientedReliableOrderedPoint-to-pointByte-streamFull duplexFlow and congestion controlled3UDP: User Datagram Protocol [RFC 768]Minimal Transport Service:Best effort service, UDP segments may be:LostDelivered out of order to appConnectionless:No handshaking between UDP sender, receiverEach UDP segment handled independently of othersWhy is there a UDP?No connection establishment (which can add delay)Simple: no connection state at sender, receiverSmall headerNo congestion control: UDP can blast away as fast as desired: dubious!

4Multiplexing / demultiplexingRecall: segment - unit of data exchanged between transport layer entities aka TPDU: transport protocol data unitapplicationtransportnetwork

MP2applicationtransportnetwork

receiverHtHnDemultiplexing: delivering received segments to correct app layer processessegmentsegmentMapplicationtransportnetwork

P1MMMP3P4segmentheaderapplication-layerdata5Multiplexing / demultiplexingmultiplexing/demultiplexing:based on sender, receiver port numbers, IP addressessource, dest port #s in each segmentrecall: well-known port numbers for specific applicationsgathering data from multiple app processes, enveloping data with header (later used for demultiplexing)source port #dest port #32 bitsapplicationdata (message)other header fieldsTCP/UDP segment formatMultiplexing:6UDP, cont.Often used for streaming multimedia appsLoss tolerantRate sensitiveOther UDP uses (why?):DNSSNMPReliable transfer over UDP: add reliability at application layerApplication-specific error recover!Source port #Dest port #32 bitsApplicationdata (message)UDP segment formatLengthChecksumLength, inbytes of UDPsegment,includingheader7UDP ChecksumSender:Treat segment contents as sequence of 16-bit integersChecksum: addition (1s complement sum) of segment contentsSender puts checksum value into UDP checksum fieldReceiver:Compute checksum of received segmentCheck if computed checksum equals checksum field value:NO - error detectedYES - no error detected. But maybe errors nonetheless? Goal: detect errors (e.g., flipped bits) in transmitted segment. Note: IP only has a header checksum.8Introduction to TCPCommunication abstraction:ReliableOrderedPoint-to-pointByte-streamFull duplexFlow and congestion controlledProtocol implemented entirely at the endsFate sharing9Evolution of TCP19751980198519901982TCP & IPRFC 793 & 7911974TCP described byVint Cerf and Bob KahnIn IEEE Trans Comm1983BSD Unix 4.2supports TCP/IP1984Nagels algorithmto reduce overheadof small packets;predicts congestion collapse1987Karns algorithmto better estimate round-trip time1986Congestion collapseobserved1988Van Jacobsons algorithmscongestion avoidance and congestion control(most implemented in 4.3BSD Tahoe)19904.3BSD Renofast retransmitdelayed ACKs1975Three-way handshakeRaymond TomlinsonIn SIGCOMM 7510TCP Through the 1990s1993199419961994ECN(Floyd)Explicit CongestionNotification1993TCP Vegas (Brakmo et al)real congestion avoidance1994T/TCP(Braden)TransactionTCP1996SACK TCP(Floyd et al)Selective Acknowledgement1996HoeImproving TCP startup1996FACK TCP(Mathis et al)extension to SACK11TCP HeaderSource portDestination portSequence numberAcknowledgementAdvertised windowHdrLenFlags0ChecksumUrgent pointerOptions (variable)DataFlags:SYNFINRESETPUSHURGACK12Reliability ModelsReliability => requires redundancy to recover from uncertain loss or other failure modes.

Two types of redundancy: Spatial redundancy: independent backup copies Forward error correction (FEC) codes Problem: requires huge overhead, since the FEC is also part of the packet(s) it cannot recover from erasure of all packetsTemporal redundancy: retransmit if packets lost/errorLazy: trades off response time for reliability Design of status reports and retransmission optimization important

14Types of errors and effectsForward channel bit-errors (garbled packets)Forward channel packet-errors (lost packets)Reverse channel bit-errors (garbled status reports)Reverse channel bit-errors (lost status reports)

Protocol-induced effects: Duplicate packetsDuplicate status reportsOut-of-order packetsOut-of-order status reportsOut-of-range packets/status reports (in window-based transmissions)16Mechanisms Mechanisms:Checksum in pkts: detects pkt corruptionACK: packet correctly receivedNAK: packet incorrectly received[aka: stop-and-wait Automatic Repeat reQuest (ARQ) protocols]

Provides reliable transmission over: An error-free forward and reverse channel A forward channel which has bit-errors; reverse: ok

Cannot handle reverse-channel bit-errors; or packet-losses in either direction.17More mechanisms Mechanisms:Checksum: detects corruption in pkts & acksACK: packet correctly receivedNAK: packet incorrectly receivedSequence number: identifies packet or ack1-bit sequence number used only in forward channel [aka: alternating-bit protocols]

Provides reliable transmission over:An error-free channelA forward & reverse channel with bit-errorsDetects duplicates of packets/acks/naks

Still needs NAKs, and cannot recover from packet errors18More Mechanisms Mechanisms:Checksum: detects corruption in pkts & acksACK: packet correctly receivedDuplicate ACK: packet incorrectly receivedSequence number: identifies packet or ack1-bit sequence number used both in forward & reverse channelProvides reliable transmission over:An error-free channelA forward & reverse channel with bit-errorsDetects duplicates of packets/acksNAKs eliminated

Packet errors in either direction not handled19Reliability Mechanisms Mechanisms:Checksum: detects corruption in pkts & acksACK: packet correctly receivedDuplicate ACK: packet incorrectly receivedSequence number: identifies packet or ack1-bit sequence number used both in forward & reverse channelTimeout only at senderProvides reliable transmission over:An error-free channelA forward & reverse channel with bit-errorsDetects duplicates of packets/acksNAKs eliminatedA forward & reverse channel with packet-errors (loss)20

Example: Three-Way HandshakeTCP connection-establishment: 3-way-handshake necessary and sufficient for unambiguous setup/teardown even under conditions of loss, duplication, and delay21TCP Connection Setup: FSMCLOSEDSYNSENTSYNRCVDESTABLISTENactive OPENcreate TCBSnd SYN create TCBpassive OPENdelete TCBCLOSEdelete TCBCLOSEsnd SYNSENDsnd SYN ACKrcv SYNSend FINCLOSErcv ACK of SYNSnd ACKRcv SYN, ACKrcv SYNsnd ACKhttp://www.tcpipguide.com/free/t_TCPOperationalOverviewandtheTCPFiniteStateMachineF-2.htm22More Connection EstablishmentSocket: BSD term to denote an IP address + a port number. A connection is fully specified by a socket pair i.e. the source IP address, source port, destination IP address, destination port. Initial Sequence Number (ISN): counter maintained in OS. BSD increments it by 64000 every 500ms or new connection setup => time to wrap around < 9.5 hours.23TCP Connection Tear-downSenderReceiverFINFIN-ACKFINFIN-ACKData writeData ack24TCP Connection Tear-down: FSMCLOSINGCLOSEWAITFINWAIT-1ESTABTIME WAITsnd FINCLOSEsend FINCLOSErcv ACK of FINLAST-ACKCLOSEDFIN WAIT-2snd ACKrcv FINdelete TCBTimeout=2mslsend FINCLOSEsend ACKrcv FINsnd ACKrcv FINrcv ACK of FINsnd ACKrcv FIN+ACKhttp://www.tcpipguide.com/free/t_TCPOperationalOverviewandtheTCPFiniteStateMachineF-2.htm25Time Wait IssuesWeb servers not clients close connection firstEstablished Fin-Waits Time-Wait ClosedWhy would this be a problem?Time-Wait state lasts for 2 * MSLMSL should be 120 seconds (is often 60s)Servers often have order of magnitude more connections in Time-Wait

26Stop-and-Wait EfficiencyDataAckAckDatatframetprop =tproptframe=Distance/Speed of SignalFrame size /Bit rate=Distance Bit rateFrame size Speed of Signal=12 + 1U=2tprop+tframetframeULight in vacuum = 300 m/sLight in fiber = 200 m/sElectricity = 250 m/sNo loss or bit-errors!27Sliding Window Protocols: EfficiencyDataAcktframetpropU=Ntframe2tprop+tframe=N2+1 1 if N>2+1Note: no loss or bit-errors!29Go-Back-NSender:k-bit seq # in pkt headerAllows upto N = 2k 1 packets in-flight, unackedWindow: limit on # of consecutive unacked pkts In GBN, window = N

30Go-Back-NACK(n): ACKs all pkts up to, including seq # n - cumulative ACKSender may receive duplicate ACKs (see receiver)Robust to losses on the reverse channelCan pinpoint the first packet lost, but cannot identify blocks of lost packets in windowOne timer for oldest-in-flight pktTimeout => retransmit pkt base and all higher seq # pkts in window31Reliability Mechanisms: SummaryChecksum: detects corruption in pkts & acksACK: packet correctly receivedDuplicate ACK: packet incorrectly receivedCumulative ACK: acks all pkts upto & incl. seq # (GBN)Selective ACK: acks pkt n only (selective repeat)Sequence number: identifies packet or ack1-bit sequence number used both in forward & reverse channelsk-bit sequence number in both forward & reverse channels. Let N = 2k 1 = sequence number space size33Reliability Mechanisms: SummaryTimeout only at sender. One timer for entire window (go-back-N)One timer per pkt (selective repeat)Window: sender and receiver side. Limits on what can be sent (or expected to be received).Window size (W) upto N 1 (Go-back-N) Window size (W) upto N/2 (Selective Repeat) Buffering Only at sender (Go-back-N)Out-of-order buffering at sender & receiver (Selective Repeat)

34Reliability capabilities: SummaryProvides reliable transmission over:An error-free channelA forward & reverse channel with bit-errorsDetects duplicates of packets/acksNAKs eliminatedA forward & reverse channel with packet-errors (loss)

Pipelining efficiency: Go-back-N: Entire outstanding window retransmitted if pkt loss/errorSelective Repeat: only lost packets retransmitted performance penalty if ACKs lost (because acks non-cumulative) & more complexity35Whats Different in TCP From Link Layers?Logical link vs. physical linkMust establish connectionVariable RTTMay vary within a connection => Timeout variableReorderingHow long can packets livemax segment lifetime (MSL)Cant expect endpoints to exactly match link rateBuffer space availability, flow controlTransmission rateDont directly know transmission rate

36Sequence Number SpaceEach byte in byte stream is numbered.32 bit valueWraps aroundInitial values selected at start up timeTCP breaks up the byte stream in packets.Packet size is limited to the Maximum Segment SizeEach packet has a sequence number.Indicates where it fits in the byte streampacket 8packet 9packet 101345014950160501755037MSSMaximum Segment Size (MSS)Largest chunk sent between TCPs. Default = 536 bytes. Not negotiated. Announced in connection establishment. Different MSS possible for forward/reverse paths.Does not include TCP headerWhat all does this effect?EfficiencyCongestion controlRetransmissionPath MTU discoveryWhy should MTU match MSS?38TCP Window Flow Control: Send SideSent but not ackedNot yet sentwindowNext to be sentSent and acked39Acked but notdelivered to userNot yetackedReceive bufferwindowWindow Flow Control: Receive Side41Silly Window SyndromeProblem: (Clark, 1982)If receiver advertises small increases in the receive window then the sender may waste time sending lots of small packetsSolutionReceiver must not advertise small window increases Increase window by min(MSS,RecvBuffer/2)42Nagels Algorithm & Delayed AcksSmall packet problem:Dont want to send a 41 byte packet for each keystrokeHow long to wait for more data? Solution: Nagels algorithmAllow only one outstanding small (not full sized) segment that has not yet been acknowledged

Batching acknowledgements: Delay-ack timer: piggyback ack on reverse traffic if available200 ms timer will trigger ack if no reverse traffic available43Timeout and RTT EstimationProblem: Unlike a physical link, the RTT of a logical link can vary, quite substantiallyHow long should timeout be ?Too long => underutilizationToo short => wasteful retransmissions

Solution: adaptive timeout: based on a good estimate of maximum current value of RTT44How to estimate max RTT?RTT = prop + queuing delayQueuing delay highly variableSo, different samples of RTTs will give different random values of queuing delay

Chebyshevs Theorem: MaxRTT = Avg RTT + k*DeviationError probability is less than 1/(k**2)Result true for ANY distribution of samples45Timer GranularityMany TCP implementations set RTO in multiples of 200,500,1000msWhy?Avoid spurious timeouts RTTs can vary quickly due to cross trafficDelayed-ack timer can delay valid acks by upto 200msMake timers interrupts efficient

What happens for the first couple of packets?Pick a very conservative value (seconds)Can lead to stall if early packet lost47Retransmission AmbiguityABACKSampleRTTOriginal transmissionretransmissionRTOABOriginal transmissionretransmissionSampleRTTACKRTOX48Karns RTT EstimatorAccounts for retransmission ambiguityIf a segment has been retransmitted:Dont update RTT estimators during retransmission. Timer backoff: If timeout, RTO = 2*RTO {exponential backoff}Keep backed off time-out for next packetReuse RTT estimate only after one successful packet transmission49Timestamp ExtensionUsed to improve timeout mechanism by more accurate measurement of RTTWhen sending a packet, insert current timestamp into option4 bytes for seconds, 4 bytes for microsecondsReceiver echoes timestamp in ACKActually will echo whatever is in timestampRemoves retransmission ambiguity!Can get RTT sample on any packet50Recap: Stability of a Multiplexed System

Average Input Rate > Average Output Rate => system is unstable! How to ensure stability ?Reserve enough capacity so that demand is less than reserved capacity Dynamically detect overload and adapt either the demand or capacity to resolve overload51Congestion: Tragedy of CommonsDifferent sources compete for common or shared resources inside network. Sources are unaware of current state of resourceSources are unaware of each otherSource has self-interest. Assumes that increasing rate by N% will lead to N% increase in throughput!Conflicts with collective interests: if all sources do this to drive the system to overload, throughput gain is NEGATIVE, and worsens rapidly with incremental overload => congestion collapse!!Need enlightened self-interest!10 Mbps100 Mbps1.5 Mbps53Congestion: A Close-up View knee point after which throughput increases very slowlydelay increases fastcliff point after whichthroughput starts to decrease very fast to zero (congestion collapse)delay approaches infinity

Note (in an M/M/1 queue)delay = 1/(1 utilization)LoadLoadThroughputDelaykneecliffcongestioncollapsepacketloss54Congestion Control vs. Congestion AvoidanceCongestion control goalstay left of cliff Congestion avoidance goalstay left of kneeRight of cliff: Congestion collapseLoadThroughputkneecliffcongestioncollapse55Congestion CollapseDefinition: Increase in network load results in decrease of useful work doneMany possible causesSpurious retransmissions of packets still in flightUndelivered packetsPackets consume resources and are dropped elsewhere in networkFragmentsMismatch of transmission and retransmission unitsControl trafficLarge percentage of traffic is for controlStale or unwanted packetsPackets that are delayed on long queues56iiIf information about i , and is known in a central location where control of i or can be effected with zero time delays, the congestion problem is solved!Capacity () cannot be provisioned very fast => demand must be managedPerfect callback: Admit packets into the network from the user only when the network has capacity (bandwidth and buffers) to get the packet across. Solution Directions. 1nCapacityDemandProblem: demand outstrips available capacity57IssuesIf information about i , and is known in a central location where control of i or can be effected with zero time delays, the congestion problem is solved!Information/knowledge: Only incomplete information about the congestion situation is known (eg: loss indications, single bit, explicit rate field, measure of backlog etc)Central vs distributed:a distributed solution is requiredDemand vs capacity control: usually only the demand is controllable on small time-scales. Capacity provisioning may be possible on larger time-scales.Measurement/control points: The congestion point, congestion detection/measurement point, and the control points may be different.Time-delays: Between the various points, there may be time-varying and heterogeneous time-delays58Static solutionsQ: Will the congestion problem be solved when:a) Memory becomes cheap (infinite memory)?No bufferToo lateAll links 19.2 kb/sReplace with 1 Mb/sSSSSSSSSFile Transfer Time = 7 hoursFile Transfer time = 5 mins b) Links become cheap (high speed links)?ABSCDScenario: All links 1 Gb/s. A & B send to C => high-speed congestion!! (lose more packets faster!)Static solutions (Continued)c) Processors become cheap (fast routers & switches)60Two models of congestion control1. End-to-end model:End-systems is ultimately the source of demandEnd-system must robustly estimate the timing and degree of congestion and reduce its demand appropriatelyMust trust other end hosts to do right thingIntermediate nodes relied upon to send timely and appropriate penalty indications (eg: packet loss rate) during congestion Enhanced routers could send more accurate congestion signals, and help end-system avoid other side-effects in the control process (eg: early packet marks instead of late packet drops) Key: trust and complexity resides at end-systemsIssue: What about misbehaving flows?61Two models of congestion control2. Network-based model:A) All end-systems cannot be trusted and/or B) The network node has more control over isolation/scheduling of flowsAssumes network nodes can be trusted. Each network node implements isolation and fairness mechanisms (eg: scheduling, buffer management)A flow which is misbehaving hurts only itselfProblems: Partial soln: if flows dont back off, each flow has congestion collapse, i.e. lousy throughput during overloadSignificant complexity in network nodesIf some routers do not support this complexity, congestion still exists Classic justification of the end-to-end principle62Goals of Congestion ControlTo guarantee stable operation of packet networksSub-goal: avoid congestion collapse

To keep networks working in an efficient status Eg: high throughput, low loss, low delay, and high utilization

To provide fair allocations of network bandwidth among competing flows in steady stateFor some value of fair 63What is stability ?Equilibrium point(s) of a dynamic system For packet networksEach user will get an allocation of bandwidthChanges of network or user parameters will move the equilibrium from one point, (hopefully) after a brief transient period, to a new oneSystem should not remain indefinitely away from equilibrium if there are no more external perturbations

Example of instability: unbounded queue growth64What is fairness ?one of the most over-defined (and probably over-rated) conceptsfairness indexmax-minproportionalinfinite number of notions! Fairness for best-effort service, roughly means that services are provided to selfish, competing users in a predictable way65Eg: max-min fairness

862442f = 4: min(8, 4) = 4 min(6, 4) = 4 min(2, 4) = 2 10if link not congested, then

otherwise, if link congested

x1x2x366AllocationsFlow Control Optimization ModelGiven a set S of flows, and a set L of linksEach flow s has utility Us(xs) , xs is its sending rateEach link l has capacity clModeled as optimization (Eg: Kelly98, Low99)

67

where Sl = { s | flow s passes the link l }What is Fairness ?Achieves (w,) fairness if for any other feasible allocation [mo00]:

where ws is the weight for flow sweighted maximum throughput fairness is (w,0)weighted proportional fairness is (w,1)weighted minimum potential delay fairness is (w,2)weighted max-min fairness is (w,)Weight could be driven by economic considerations, or scheme dependencies on factors like RTT, loss rate etc

68

What is fairness ? (contd)fairness (-) axis69012 = 0 : maximum throughput fairness = 1 : proportional fairness = 2 : minimum delay fairness = : max-min fairnessProportional vs Max-min Fairnessproportional fairnessthe more a flow consumes critical network resources, the less allocationnetwork as a white boxnetwork operators viewf0 = 0.1, f1~9 = 0.97070f0f1f2f9r1r2r3r10Ci = 1max-min fairnessevery flow has the same right to all network resourcesnetwork as a black boxnetwork users viewf0 = f1~9 = 0.5EquilibriumOperate at equilibrium near the knee pointHow to maintain equilibrium?Packet-conservation: Dont put a packet into network until another packet leaves. Use ACK: send a new packet only after you receive and ACK. Why?A.k.a Self-clocking or Ack-clockingIn steady state, keep # packets in network constantProblem: how do you know you are at the knee?Network capacity or competing demand may change: Need to probe for knee by increasing demandNeed to reduce demand overshoot detectedEnd-result: oscillate around knee Violate packet-conservation each time you probe by the degree of demand increase71Self-clockingPrPbArAbReceiverSenderAs Implications of ack-clocking: More batching of acks => bursty traffic Less batching leads to a large fraction of Internet traffic being just acks (overhead)72Basic Control ModelLets assume window-based operationReduce window when congestion is perceivedHow is congestion signaled? Either mark or drop packetsWhen is a router congested? Drop tail queues when queue is full Average queue length at some thresholdIncrease window otherwiseProbe for available bandwidth how?73Simple linear controlMany different possibilities for reaction to congestion and methods for probingExamine simple linear controlsWindow(t + 1) = a + b Window(t)Different ai/bi for increase and ad/bd for decreaseSupports various reaction to signalsIncrease/decrease additivelyIncreased/decrease multiplicativelyWhich of the four combinations is optimal?74Phase plotsSimple way to visualize behavior of competing flows over time

Caveat: assumes 2 flows, synchronized feedback, equal RTT, discrete rounds of operation Efficiency LineFairness LineUser 1s Allocation x1User 2s Allocation x2Optimal pointOverloadUnderutilization75Multiplicative Increase/DecreaseBoth X1 and X2 increase by the same factor over timeFairness unaffected (constant), but load increases (MI) or decreases (MD)T0T1Efficiency LineFairness LineUser 1s Allocation x1User 2s Allocation x277Additive Increase/Multiplicative Decrease (AIMD) PolicyAssumption: decrease policy must (at minimum) reverse the load increase over-and-above efficiency lineImplication: decrease factor should be conservatively set to account for any congestion detection lags etcx0x1x2Efficiency LineFairness LineUser 1s Allocation x1User 2s Allocation x278TCP Congestion ControlMaintains three variables:cwnd congestion windowrcv_win receiver advertised window ssthresh threshold size (used to update cwnd) Rough estimate of knee point

For sending use: win = min(rcv_win, cwnd)79TCP: Slow StartGoal: initialize system and discover congestion quicklyHow? Quickly increase cwnd until network congested get a rough estimate of the optimal cwndHow do we know when network is congested?packet loss (TCP)over the cliff here congestion control congestion notification (eg: DEC Bit, ECN)over knee; before the cliffcongestion avoidance Implications of using loss as congestion indicatorLate congestion detection if the buffer sizes largerHigher speed links or large buffers => larger windows => higher probability of burst lossInteractions with retransmission algorithm and timeouts80TCP: Slow StartWhenever starting traffic on a new connection, or whenever increasing traffic after congestion was experienced: Set cwnd =1 Each time a segment is acknowledged increment cwnd by one (cwnd++).

Does Slow Start increment slowly? Not really. In fact, the increase of cwnd is exponential!!Window increases to W in RTT * log2(W)81Slow Start ExampleThe congestion window size grows very rapidly

TCP slows down the increase of cwnd when cwnd >= ssthresh ACK for segment 1segment 1cwnd = 1cwnd = 2segment 2segment 3ACK for segments 2 + 3cwnd = 4segment 4segment 5segment 6segment 7ACK for segments 4+5+6+7cwnd = 882Slow Start Example1One RTTOne pkt time0R21R342R56783R9101112131415123456783Slow Start Sequence PlotTimeSequence No...Window doubles every round84Congestion AvoidanceGoal: maintain operating point at the left of the cliff:How?additive increase: starting from the rough estimate (ssthresh), slowly increase cwnd to probe for additional available bandwidthmultiplicative decrease: cut congestion window size aggressively if a loss is detected.85Congestion AvoidanceSlow down Slow Start

If cwnd > ssthresh then each time a segment is acknowledged increment cwnd by 1/cwnd i.e. (cwnd += 1/cwnd).

So cwnd is increased by one only if all segments have been acknowledged.(more about ssthresh latter)86Congestion Avoidance Sequence PlotTimeSequence NoWindow growsby 1 every round87Slow Start/Congestion Avoidance Eg.Assume that ssthresh = 8

Roundtrip timesCwnd (in segments)ssthresh88Putting Everything Together:TCP Pseudo-codeInitially:cwnd = 1;ssthresh = infinite;New ack received:if (cwnd < ssthresh) /* Slow Start*/ cwnd = cwnd + 1;else /* Congestion Avoidance */ cwnd = cwnd + 1/cwnd;Timeout: (loss detection)/* Multiplicative decrease */ssthresh = win/2;cwnd = 1;while (next < unack + win)transmit next packet;

where win = min(cwnd, flow_win);unacknextwinseq #89The big pictureTimecwndTimeoutSlow StartCongestionAvoidance90Packet Loss Detection: Timeout AvoidanceWait for Retransmission Time Out (RTO)Whats the problem with this? Because RTO is a performance killerIn BSD TCP implementation, RTO is usually more than 1 secondthe granularity of RTT estimate is 500 msretransmission timeout is at least two times of RTT

Solution: Dont wait for RTO to expire Use alternate mechanism for loss detectionFall back to RTO only if these alternate mechanisms fail.91Fast RetransmitResend a segment after 3 duplicate ACKsRecall: a duplicate ACK means that an out-of sequence segment was receivedNotes: duplicate ACKs due packet reordering!if window is small dont get duplicate ACKs!ACK 1segment 1cwnd = 1cwnd = 2segment 2segment 3ACK 3cwnd = 4segment 4segment 5segment 6segment 7ACK 13 duplicateACKsACK 4ACK 4ACK 492Fast Recovery (Simplified)After a fast-retransmit set cwnd to ssthresh/2i.e., dont reset cwnd to 1But when RTO expires still do cwnd = 1

Fast Retransmit and Fast Recovery implemented by TCP Reno; most widely used version of TCP today 93Fast Retransmit and Fast RecoveryRetransmit after 3 duplicated acksprevent expensive timeoutsNo need to slow start againAt steady state, cwnd oscillates around the optimal window size.TimecwndSlow StartCongestionAvoidance94Fast RetransmitTimeSequence NoDuplicate AcksRetransmissionX95Multiple LossesTimeSequence NoDuplicate AcksRetransmissionXXXXNow what?96TimeSequence NoXXXXTCP Versions: Tahoe97TCP Versions: RenoTimeSequence NoXXXXNow what? - timeout98NewRenoThe ack that arrives after retransmission (partial ack) should indicate that a second loss occurredWhen does NewReno timeout?When there are fewer than three dupacks for first lossWhen partial ack is lostHow fast does it recover losses?One per RTT99NewRenoTimeSequence NoXXXXNow what? partial ackrecovery100SACKBasic problem is that cumulative acks only provide little informationAlt: Selective Ack for just the packet receivedWhat if selective acks are lost? carry cumulative ack also!

Implementation: Bitmask of packets received Selective acknowledgement (SACK)Only provided as an optimization for retransmissionFall back to cumulative acks to guarantee correctness and window updates

101SACKTimeSequence NoXXXXNow what? sendretransmissions as soonas detected102Asymmetric BehaviorThree important characteristics of a path LossDelayBandwidthForward and reverse paths are often independent even when they traverse the same set of routersMany link types are unidirectional and are used in pairs to create bi-directional linkAIBInternet(no congestion, bandwidth > 6Mbps)6Mbps32kbps103Asymetric LossLossInformation in acks is very redundantLow levels of ack loss will not create problemsTCP relies on ack clocking will burst out packets when cumulative ack covers large amount of dataBurstiness will in turn cause queue overflow/loss

Max burst size for TCP and/or simple rate pacingCritical also during restart after idle104Ack CompressionWhat if acks encounter queuing delay?Smooth ack clocking is destroyedBasic assumption that acks are spaced due to packets traversing forward bottleneck is violatedSender receives a burst of acks at the same time and sends out corresponding burst of dataHas been observed and does lead to slightly higher loss rate in subsequent window105Bandwidth AsymmetryCould congestion on the reverse path ever limit the throughput on the forward link?Lets assume MSS = 1500bytes and delayed acksFor every 3000 bytes of data need 40 bytes of acks75:1 ratio of bandwidth can be supportedModem uplink (28.8Kbps) can support 2Mbps downlinkMany cable and satellite links are worse than thisSolutions: Header compression, link-level supportAIBInternet(no congestion, bandwidth > 6Mbps)6Mbps32kbps106TCP Congestion Control SummarySliding window limited by receiver window.Dynamic windows: slow start (exponential rise), congestion avoidance (additive rise), multiplicative decrease.Ack clockingAdaptive timeout: need mean RTT & deviationTimer backoff and Karns algo during retransmission

Go-back-N or Selective retransmissionCumulative and Selective acknowledgementsTimeout avoidance: Fast Retransmit

107