Congestion Control

Congestion Control

Internet Traffic Engineering

Measurement: for reality check Experiment: for Implementation Issues Analysis:

Bring fundamental understanding of systems May loose important facts because of

simplification Simulation:

Complementary to analysis: Correctness, exploring complicate model

May share similar model to analysis

What is congestion ? What is congestion ?

The aggregate demand for bandwidth exceeds the available capacity of a link.

What will be occur ? Performance Degradation

• Multiple packet losses• Low link utilization (low Throughput)• High queueing delay• Congestion collapse

What is congestion ? – 2

Congestion Control

Open-loop control Mainly used in circuit switched network (GMPLS)

Closed-loop control Mainly used in packet switched network Use feedback information: global & local

Implicit feedback control End-to-end congestion control Examples:TCP Tahoe, TCP Reno, TCP Vegas, etc.

Explicit feedback control Network-assisted congestion control Examples:IBM SNA, DECbit, ATM ABR, ICMP source quench, RED, ECN

Congestion Control and Avoidance

Two approaches of handling Congestion

Congestion Control (Reactive)• Play after the network is overloaded

Congestion Avoidance (Proactive)• Play before the network becomes

overloaded

Open-loop control --- congestion avoidance

source establishes traffic descriptor with network describing its needs

net typically reserves resources and performs enforcement: admission control for new connections shaping or policing at edges for data

challenges: choosing the traffic descriptor, choosing scheduling discipline at routers, performing admission control

Implicit vs. Explicit feedback

Implicit feedback Congestion Control

Network drops packets when congestion occur

Source infers congestion implicitly• time-out, duplicated ACKs, etc.

Example: end-to-end TCP congestion Control

Simple to implement but inaccurate • implemented only at transport layer (e.g., TCP)

Implicit vs. Explicit feedback - 2

Explicit feedback Congestion Control Network component (e.g., router) provides

congestion indication explicitly to sources• use packet marking, or RM cells (in ATM ABR

control) Examples: DECbit, ECN, ATM ABR CC, etc. Provide more accurate information to sources But is more complicate to implement

• Need to change both source and network algorithm• Need cooperation between sources and network

component

TCP Congestion Control

Uses end-to-end congestion control uses implicit feedback

• e.g., time-out, triple duplicated ACKs, etc. uses window based flow control

• cwnd = min (pipe size, rwnd)• self-clocking (ACKs pace transmission)• slow-start and congestion avoidance

Examples:• TCP Tahoe, TCP Reno, TCP Vegas, etc.

Congestion routers receive packets at a rate faster

than the routers can process newly arriving packets are dropped network congested

if a packet is lost, the source re-transmits all sources do the same causes even more congestion (congestion collapse !)

solution : slow down the sources! how to know when to slow-down ? by how much ?

Congestion

knee – point after which

throughput increases slowly delay increases fast

cliff – point after which

throughput starts to decrease fast to zero

delay approaches infinity

load

load

thro

ug

hpu

td

ela

y

knee cliff

congestioncollapse

packetloss

congestionavoidance

congestioncontrol

Goals sender operates near knee point

source should not put a new packet into network until another packet leaves how ? use ACKs ! i.e. send a new packet only after receiving an ACK (self-clocking) maintain the number of packets in the network constant

Self-clocking

PrPb

Ar

Ab

ReceiverSender

As

TCP Congestion Control

TCP-sender maintains three variables cwnd – congestion window rcv_win – receiver advertised window ssthresh – slow start threshold (used to

update cwnd, intuitively ssthresh is a rough estimate of the knee point)

send_win = min (rcv_win, cwnd)

TCP Tahoe implements

slow start congestion avoidance fast retransmit algorithm

Slow Start (Simplified)

(initially) cwnd =1*Max Segment Size (MSS)

each time an ACK received for a segmentcwnd += 1* MSS (exponential growth of cwnd !)

if loss (i.e. timeout), cwnd = 1*MSS again

Congestion Avoidance (Simplified)

for each ACK received cwnd += ( MSS*MSS/ cwnd) approximation of increasing the cwnd by 1*MSS per RTT (additive increase).

if loss(i.e. timeout), cut the cwnd by half (multiplicative decrease).

Slow Start & Congestion Avoidance

ssthresh

• initally:

cnwd = 1*MSS, ssthresh = very high

• if a new ACK comes:

- if cnwd < ssthresh update cwnd according to slow start

- if cwnd >= ssthresh update cnwd according to congestion avoidance

• if timeout (i.e. loss) :

- ssthresh = send_win/2;

- cwnd = 1*MSS

time

cwnd

timeout (loss)

slow start – in green

congestion avoidance – in blue

(initial) ssthresh

Eight ACKs

assume (initial) ssthresh = 8*MSS

Example: Slow Start/Congestion Avoidance

cwnd = 10

cwnd = 2

cwnd = 4

cnwd = 8

cwnd = 1

cwnd = 9

Eight TCP-PDUs

nineACKs

nine TCP-PDUs

ten ACKs

ten TCP-PDUs

cwnd = 11

0

2

4

6

8

10

12

1 2 3 4 5 6 7

transmission number

con

ges

tio

n w

ind

ow

siz

e (i

n M

SS

)

ssthresh

Fast Retransmit sender receives

3 dupACKS sender infers that the segment is lost sender doesn’t wait for timeout sender re-sends the segment immediately!

ACK 1

segment 1cwnd = 1

cwnd = 2 segment 2segment 3

ACK 3

cwnd = 4 segment 4segment 5segment 6segment 7

ACK 2

3 duplicateACKs ACK 3

ACK 3

ACK 3

segment 4

fast-retransmit

of segment 4

Time

Sequence NoX

TCP Versions: Tahoe

dataack

fast-retransmit

after fast-retransmit

sshtresh = send_win/2;

cnwd = 1*MSS ;

i.e. sender goes back to slow-start !

TCP Reno implements

slow start congestion avoidance fast retransmit algorithm & fast

recovery

Fast Recovery

intuition: receipt of dupACKs tells to the sender that the receiver is still getting new segments, i.e. there is still data flow between sender and receiver then why sender goes back to slow start after fast retransmit

cwnd

Slow Start Congestion AvoidanceTime

“inflating” cwnd with dupACKs

“deflating” cwnd with a new ACK

new ACK

fast-retransmit

fast-retransmit

new ACK

timeout

(initial) ssthresh

sender does the following after receiving 3 dupACKS:

1. sets sshresh = send_win/22. retransmits the lost segment 3. sets cwnd = sshthresh + 3*MSS4. for each dupACK received cwnd += 1*MSS

(“inflating” cwnd)5. if a newACK arrives cwnd = sshresh (value in step

1) (“deflating” cwnd) , and exit fast recovery .

remember: if sender times out, ssthresh = send_win/2, cnwd =1 ! (that is go back to slow start again!)

Fast Re-transmit & Fast Recovery

TCP New Reno implements

slow start congestion avoidance fast retransmit & modified fast

recovery

motivation: fast recovery (as in Reno) can not recover from multiple losses within the same window efficiently.

Modified Fast Recovery

Sequence No

X

X

XX

Now what ? - timeout

TCP Reno – with multiple losses within the same window

dataack

Time

NewReno

Time

Sequence No

X

X

XX

Now what ? – partial ack

recovery

dataack

Modifications to fast recovery

partial ACKs (i.e. the ACK that acks some but not all the packets that were outstanding at the start of fast recovery) : indications of multiple losses

if partial ACK received, re-transmit the next lost segment immediately (whereas in Reno, partial ACKs take TCP out of fast recovery).

sender remains in fast recovery until all data outstanding when fast recovery was initiated is acked.

Explicit Congestion Notification (ECN)

Current congestion indication Use packet drop to indicate congestion Sources infer congestion implicitly from timeout or triple

duplicate ACKs

ECN [IETF RFC2481, 1999] To give less packet drop and better performance

Uses packet marking rather than dropping Reduces long timeout and retransmission

Needs cooperation between sources and network Sources must indicate that they are ECN-capable Sources and receivers must agree to use ECN Receiver must inform sources of ECN marks Sources must react to marks just like losses

ECN - 2 Needs additional flags in TCP header and IP

header In IP header: ECT and CE

ECN Capable Transport (ECT): Set by sources on all packets to indicate ECN-capability

Congestion Experienced (CE): Set by routers as a (congestion) marking (instead of

dropping)

In TCP header: ECE and CWR Echo Congestion Experienced (ECE):

When a receiver sees CE, sets ECE on all packets until CWR is received

Congestion Window Reduced (CWR): Set by a source to indicate that ECE was received and the

window size was adjusted (reduced)

ECN - 3

1

TCP Header

ECT CE

1 0IP Header

CWR

0

ECT CE

CWR

2

1 1

0

3

ACK TCPHeader

ECN-Echo

1

4

TCP Header

CWR

1

Source Router Destination

Active Queue Management (AQM) - 1

Performance Degradation in current TCP Congestion Control Multiple packet loss Low link utilization Congestion collapse

The role of the router becomes important Control congestion effectively in networks Allocate bandwidth fairly

AQM - 2

Problems with current router algorithm Use FIFO based tail-drop (TD) queue management Two drawbacks with TD: lock-out, full-queue

Lock-out: a small number of flows monopolize usage of buffer capacity Full-queue: The buffer is always full (high queueing delay)

Possible solution: AQM Definition: A group of FIFO based queue management

mechanisms to support end-to-end congestion control in the Internet

AQM - 3 Goals of AQM

Reducing the average queue length: Decreasing end-to-end delay

Reducing packet losses: More efficient resource allocation

Methods: Drop packets before buffer becomes full Use (exponentially weighted) average queue

length as an congestion indicator Examples: RED, BLUE, ARED, SRED, FRED,

….

RED-IntroductionMain idea:: to provide congestion control

at the router for TCP flows. RED Algorithm Goals

The primary goal is to provide congestion avoidance by controlling the average queue size such that the router stays in a region of low delay and high throughput.

To avoid global synchronization (e.g., in Tahoe TCP).

To control misbehaving users (this is from a fairness context).

To seek a mechanism that is not biased against bursty traffic.

RED-Definitions congestion avoidance – when impending

congestion is indicated, take action to avoid congestion.

incipient congestion – congestion that is beginning to be apparent.

need to notify connections of congestion at the router by either marking the packet [ECN] or dropping the packet {This assumes a drop is an implied signal to the source host.}

RED-Previous Work

Drop Tail Random Drop Early Random Drop Source Quench messages DECbit scheme

RED-Drop Tail Router

• FIFO queueing mechanism that drops packets when the queue overflows.

• Introduces global synchronization when packets are dropped from several connections.

RED-Random Drop Router

• When a packet arrives and the queue is full, randomly choose a packet from the queue to drop.

RED-Early Random Drop Router

• If the queue length exceeds a drop level, then the router drops each arriving packet with a fixed drop probability.

• Reduces global synchronization

• Does not control misbehaving users (UDP)

?

Drop level

RED-Source Quench messages

Router sends source quench messages back to source before queue reaches capacity.

Complex solution that gets router involved in end-to-end protocol.

RED-DECbit scheme

Uses a congestion-indication bit in packet header to provide feedback about congestion.

Average queue length is calculated for last (busy + idle) period plus current busy period.

When average queue length exceeds one, set congestion-indicator bit in arriving packet’s header.

If at least half of packets in source’s last window have the bit set, decrease the congestion window exponentially.

RED Algorithmfor each packet arrival

calculate the average queue size avgif minth <= avg < maxth

calculate the probability pa

with probability pa:

mark the arriving packetelse if maxth <= avg

mark the arriving packet

RED drop probability ( pa )

pb = maxp x (avg - minth)/(maxth - minth) [1]

wherepa = pb/ (1 - count x pb) [2]

Note: this calculation assumes queue size is measured in packets. If queue is in bytes, we need to add [1.a] between [1] and [2]

pb = pb x PacketSize/MaxPacketSize [1.a]

avg - average queue length

avg = (1 – wq) x avg + wq x q

where q is the newly measured queue length.

This exponential weighted moving average is designed such that short-term increases in queue size from bursty traffic or transient congestion do not significantly increase average queue size.

RED/ECN Router Mechanism

1

0

Average Queue Length

Minth Maxth

Dropping/Marking Probability

Queue Size

maxp

RED parameter settings wq suggest 0.001 <= wq <= 0.0042

authors use wq = 0.002 for simulations minth, maxth depend on desired average queue size

bursty traffic increase minth to maintain link utilization.

maxth depends on the maximum average delay allowed.

RED is most effective when average queue size is larger than typical increase in calculated queue size in one round-trip time.

“parameter setting rule of thumb”: maxth at least twice minth . However, maxth = 3 times minth is used in some of the experiments shown.

packet-marking probability goal: To uniformly spread out the marked

packets. This reduces global synchronization.

Method 1: geometric random variable each packet marked with probability pb

Method 2: uniform random variable marking probability is pb/ (1 - count x pb)

where count is the number of unmarked packets arrived since last marked packet.

Method 1: geometric p = 0.02

Method 2: uniform

Result :: marked packets more clustered for method 1 uniform is better at eliminating “bursty drops”

Setting maxp

“ RED performs best when packet-marking probability changes fairly slowly as the average queue size changes.” This is a stability argument in that the claim is

that RED with small maxp will reduce oscillations in avg and actual marking probability.

They recommend that maxp never be greater than 0.1

{This is not a robust recommendation}.

Evaluation of RED meeting design goals congestion avoidance

If RED drops packets, this guarantees the calculated average queue size does not exceed the max threshold. If wq set properly, RED controls actual average queue size.

If RED marks packets, router relies on source cooperation to control average queue size.

Evaluation of RED meeting design goals

appropriate time scales detection time scale roughly matches

time scale of response to congestion RED does not notify connections

during transient congestion at the router.

Evaluation of RED meeting design goals no global synchronization

avoids global synchronization by marking at as low a rate as possible with distribution spread out

simplicity detailed argument about how to cheaply

implement in terms of adds and shifts. {Historically, this argument has been

strongly refuted because RED has too many parameters to make it robust.}

Evaluation of RED meeting design goals

maximizing global power power defined as ratio of throughput to

delay fairness

authors claim not well-defined {This is an obvious side-step of this

issue.} [later this becomes big deal -see FRED

paper]

Conclusions

RED is effective mechanism for congestion avoidance at the router in cooperation with TCP.

claim: probability that RED chooses a particular connection to notify during congestion is roughly proportional to that connection’s share of the bandwidth.

BLUE

Concept To avoid drawbacks of RED

Parameter tuning problem Actual queue length fluctuation

Decouple congestion control from queue length

Use only loss and idle event as an indicator Maintains a single drop prob., pm

Drawback Can not avoid some degree of multiple packet

loss and/or low utilization

BLUE’s Algorithm (I)

Upon Packet loss ( or Qlen > L) event:

if (now – last_update ) > freeze_time ) then pm = pm + d1

last_update = nowUpon link idle event:

if (now- last_update) > freeze_time ) then pm = pm + d2

last_update = now

BLUE’s Algorithm (II) Update trigger events:

Packet loss – increase dropping probability

Link idle – decrease dropping probability

Parameters: freeze_time: update frequency, could be

randomized to avoid global synchronization

BLUE’s Algorithm (III)

Parameters: d1: increment step d2: decrement step d1 is signification larger than d2

– backoff more aggressively

Discussions concerning AQM

Problems with existing AQM Proposals Mismatch between macroscopic and

microscopic behavior of queue length Insensitivity to the change of input traffic

load Configuration (parameter setting) problem

Reasons: Queue length averaging use inappropriate congestion indicator Use inappropriate control function

Documents

Congestion Control