Congestion Avoidance and Control

Congestion Avoidance and Control – Jacobson & Karels, 1988

(** corrected spelling of Jacobs_o_n **)

It’s October 1986 and there’s trouble on the internet. A congestion collapse has reduced the bandwidth between LBL and UC Berkeley by a factor of a thousand. These two sites happened to be 400 yds apart. And that drop in capacity? From 32 Kbps to 40 bps! This was just the first in a series of such collapses. Van Jacobsen and Michael Karels set out to investigate. Beyond just TCP/IP, the lessons they learned have relevance for any situation where traffic management & flow control are important. For example, they can be applied when managing back-offs and retries with circuit breakers and dimmer switches, and when deciding how quickly to let traffic through again on restoring flow (with a little imagination to consider a window of outstanding requests).

To avoid congestion collapse, the authors propose the principle of the conservation of packets.

By ‘conservation of packets’ we mean that for a connection ‘in equilibrium’, i.e., running stably with a full window of data in transit, the packet flow is what a physicist would call ‘conservative’: A new packet isn’ t put into the network until an old packet leaves. The physics of flow predicts that systems with this property should be robust in the face of congestion.

Note that QJump which we looked at last week uses exactly this principle.

If this principle were obeyed, congestion collapse would become the exception rather than the rule. Thus congestion control involves finding places that violate conservation and fixing them.

There are only three ways that packets can fail to be conserved:

  1. The connection doesn’t get to equilibrium (typically, the sender floods the receiver on start or restart)
  2. A sender injects a new packet before an old packet has exited
  3. The equilibrium can’t be reached because of resource limits along the path

These can be addressed by, respectively: starting slowly to bring a connection up to equilibrium; conserving equilibrium through management of RTT variability to avoid too many retransmissions; and adapting the path by throttling back sources under in-network congestion.

Slow Start

Once a connection is up and running acks can regulate a ‘clock’ that stops packets being sent too quickly for the receiver, since acks can’t be generated any faster than packets can get through the network. But how do you get the data flowing in the first place?

To start the `clock’, we developed a slow-start algorithm to gradually increase the amount of data in-transit. Although we flatter ourselves that the design of this algorithm is rather subtle, the implementation is trivial—one new state variable and three lines of code in the sender:

  • Add a congestion window, cwnd, to the per-connection state.
  • When starting or restarting after a loss, set cwnd to one packet.
  • On each ack for newdata, increase cwnd by one packet.
  • When sending, send the minimum of receiver’s advertised window and cwnd.

Actually, the slow-start window increase isn’ t that slow: it takes time R log2W where R is the round-trip-time and W is the window size in packets. This means the window opens quickly enough to have a negligible effect on performance, even on links with a large bandwidth–delay product.

Retransmissions (retries)

A good round trip time estimator, the core of the retransmit timer, is the single most important feature of any protocol implementation that expects to survive heavy load. And it is frequently botched…One mistake is not estimating the variation of the round trip time, R . From queuing theory we know that R and the variation in R increase quickly with load.

See the details in appendix A of the paper for a cheap to compute variation estimator (prior versions of the TCP specification recommended a constant value to use), which can then be used to update the retransmit timeout operator for the next packet sent.

Another timer mistake is in the backoff after a retransmit: If a packet has to be retransmitted more than once, how should the retransmits be spaced? For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff—but a proof of this is beyond the scope of this paper.

It’s all to do with linear system theory apparently:

Linear system theory says that if a system is stable, the stability is exponential. This suggests that an unstable system (a network subject to random load shocks and prone to congestive collapse can be stabilized by adding some exponential damping (exponential timer backoff) to its primary excitation (senders, traffic sources).

Congestion management

If timeouts are set appropriately as above, then timeouts will be due to lost packets. Packets could be damaged in transit (comparatively rare), or dropped due to congestion in the network. Therefore we can assume that timeouts are a good proxy sign for network congestion.

The network must be able to signal the transport endpoints that congestion is occurring (or about to occur). And the endpoints must have a policy that decreases utilization if this signal is received and increases utilization if the signal isn’ t received.

We can use timeouts as the signal, but what shoud the policy be?

When the network is congested… the queue lengths will start increasing exponentially. The system will stabilize only if the traffic sources throttle back at least as quickly as the queues are growing.

Load is controlled by adjusting the size of the window. So when congestion is signalled the window is resized to d.W, where d is a constant The first thought is to use a symmetric, multiplicative increase, possibly with a longer time constant, Wi = bWi-1 where 1 < b < 1/d . This is a mistake. The result will oscillate wildly and, on the average, deliver poor throughput.

The best increase policy is to make small, constant changes to the window size: Wi = Wi-1 + u. We end up with the following simple implementation:

  • On any timeout, set cwnd to half the current window size (this is the multiplicative decrease).
  • On each ack for new data, increase cwnd by 1/cwnd (this is the additive increase).
  • When sending, send the minimum of the receiver’ s advertised window and cwnd.