Published on (
 See this if you're having trouble printing code examples

O'Reilly Book Excerpts: BGP

Traffic Engineering: Queuing, Traffic Shaping, and Policing

Related Reading

Building Reliable Networks with the Border Gateway Protocol
By Iljitsch van Beijnum

by Iljitsch van Beijnum

Editor's Note: In the fifth and final installment in this series of excerpts on Traffic Engineering from O'Reilly's BGP, learn how to increase performance for certain protocols or sessions using special queuing strategies, traffic shaping, and rate limiting.

Queuing, Traffic Shaping, and Policing

Traffic engineering works only if you have bandwidth to spare on one of your connections. Even the most sophisticated traffic balancing techniques won't help you when there is just too much traffic. When the output queues for interfaces start filling up, interactive protocols start noticing delays, and bulk protocols start noticing lower throughput. The best way to handle this would be to get more bandwidth, but with some smart queuing techniques, it's possible to increase performance for some protocols or sessions without hurting others very much. Or just give way to "important" packets and let less important traffic suffer. There are three ways to accomplish this: special queuing strategies, traffic shaping, and rate limiting. Before choosing one, you should know how each interacts with TCP.

Nearly all applications that run over the Internet use the TCP (RFC 793) "on top of" IP. IP can only transmit packets of a limited size, and the packets may arrive corrupted by bit errors on the communications medium, in the wrong order, or not at all. Also, IP provides no way for applications to address a specific program running on the destination host. All this missing functionality is implemented in TCP. The characteristics of TCP are:

"Stream" interface: Any and all bytes the application writes to the stream come out in the same order at the application running on the remote host. There is no packet size limit: TCP breaks up the communication into packets as needed.

In This Series

Traffic Engineering: Specific Routes
In this fourth installment on Traffic Engineering, excerpted from O'Reilly's BGP, learn how to balance incoming traffic by announcing more specific routes.

Traffic Engineering: Incoming Traffic
In this third installment on Traffic Engineering, excerpted from O'Reilly's BGP, learn how to balance inbound traffic.

Traffic Engineering: Local Routing Policy
In this second installment on Traffic Engineering, excerpted from O'Reilly's BGP, learn how to influence the BGP path-selection process.

Traffic Engineering: Finding the Right Route
In this first installment on Traffic Engineering, excerpted from O'Reilly's BGP, learn how to find the best route in a multihomed setup--the one that will take advantage of all available bandwidth.

Integrity and reliability: TCP performs a checksum calculation over every segment (packet) and throws away the segment if the checksum fails. It keeps resending packets until the data is received (and acknowledged) successfully by the other end, or until it becomes apparent that the communications channel is unusable, and the connection times out.

Multiplexing: TCP implements "ports" to multiplex different communication streams between two hosts, so applications can address a specific application running on the remote host. For instance, web servers usually live on port 80. When a web browser contacts a server, it also selects a source port number so that the web page can be sent back to this port, and the page will end up with the right browser process. Well-known server ports are usually (but not always) below 1024; client source ports are semirandomly selected from a range starting at 1024 or higher.

Congestion control: Finally, TCP provides congestion control: it makes sure it doesn't waste resources by sending more traffic than the network can successfully carry to the remote host.

Most of what TCP does falls outside the scope of this book, so it won't be discussed here.[2] It's good to know about the congestion control mechanisms TCP employs, however, because they have a strong impact on the traffic patterns on the network.

TCP Congestion Control

Apart from the basic self-timing that happens because TCP uses a windowing system where only a limited amount of data may be in transit at any time, there are four additional congestion-related mechanisms in TCP: slow start, congestion avoidance, fast retransmit, and fast recovery. These algorithms are documented in RFC 2001.

Slow start

When a TCP connection is initiated, the other side tells the local TCP how much data it's prepared to buffer. This is the "advertised window." Setting up a connection takes three packets: an initial packet with the SYN control bit set (a "SYN packet"), a reply from the target host with both the SYN and ACK bits set, and a final packet from the initiating host back to the target acknowledging the SYN/ACK packet. This is the three-way handshake.

After the three-way handshake, the local (and remote) TCP may transmit data until the advertised window is full. Then it has to wait for an acknowledgment (ACK) for some of this data before it can continue transmitting. When the remote TCP advertises a large window, the local TCP doesn't send a full window's worth of data at once: there may be a low-bandwidth connection somewhere in the path between the two hosts, and the router that terminates this connection may be unable to buffer such a large amount of data until it can traverse the slow connection. Thus, the sending TCP uses a congestion window in addition to the advertised window. The congestion window is initialized as one maximum segment size, and it doubles each time an ACK is received. If the segment size is 1460 bytes (which corresponds to a 1500-byte Ethernet packet minus IP and TCP headers), and the receiver advertises a 8192-byte window, the sending TCP initializes the congestion window to 1460 bytes, transmits the first packet, and waits for an ACK. When the first ACK is received, the congestion window is increased to 2920 bytes, and two packets are transmitted. When the first one of these is ACKed, the congestion window becomes 5840 bytes, so four packets may now be in transit. One packet is still unacknowledged, so three new packets are transmitted. After receiving the next ACK, the congestion window increases beyond the advertised window, so now it's the advertised window that limits the amount of unacknowledged data allowed to be underway.

Congestion avoidance

Congestion avoidance introduces another variable: the slow start threshold size (ssthresh). When a connection is initialized, the ssthresh is set to 65,535 bytes (the maximum possible advertised window). As long as no data is lost, the slow start algorithm is used until the congestion window reaches its full size. If TCP receives an out-of-order ACK, however, congestion avoidance comes into play. An out-of-order ACK is an acknowledgment for data that was already acknowledged before. This happens when a packet gets lost: the receiving TCP sends an ACK for the data up to the lost packet, indicating, "I'm still waiting for the data following what I'm ACKing now." TCP ACKs are cumulative: it isn't possible to say "I got bytes 1000-1499, but I'm missing 500-999."

Upon receiving a duplicate ACK, the sending TCP assumes the unacknowledged data has been lost because of congestion, and the ssthresh and also the congestion window are set to half of the current window size, as long as this is at least two times the maximum segment size. After this, the congestion window is allowed to grow only very slowly, to avoid immediate return of the congestion. If the sending TCP doesn't see any ACKs at all for some period of time, it assumes massive congestion and triggers slow start, in addition to lowering the ssthresh. So as long as the congestion window is smaller than or equal to the ssthresh, slow start is executed (congestion window doubles after each ACK), and after that congestion avoidance (congestion window grows slowly).

Fast retransmit and fast recovery

When TCP receives three out-of-order ACKs in a row, it assumes that just a single packet was lost. (One or two out-or-order ACKs are likely to be the result of packet reordering on the network.) It then retransmits the packet it thinks has been lost, without waiting for the regular retransmit timer to expire. The ssthresh is set as per congestion avoidance, but the congestion window is set to the ssthresh plus three maximum segments: this is the amount of data that was successfully received by the other end, as indicated by the out-of-order ACKs. The result is that TCP slows down a bit, but not too much, because there is obviously still a reasonable amount of data coming through.

TCP Under Packet Loss and Delay Conditions

The result of these four mechanisms is that TCP slows down a lot when multiple packets are lost. The problem is even worse when the round-trip times are long, because the use of windows limits TCP's throughput to a window size per round-trip-time. This means that even with the maximum window size of just under 64 KB (without the TCP high-performance extensions enabled), TCP performance over a transcontinental circuit with a round trip delay of 70 ms will not exceed 900 Kbps. When a packet is lost, this speed is nearly halved, and it takes hundreds of successfully acknowledged packets to get back up to the original window size. So even sporadic packet loss can bring down the effectively used bandwidth for a single TCP session over a high-delay path. This means that packet loss can be tolerated only on low-delay connections, and only as long as those connections are not part of a high-delay path.

The behavior of the two main categories of non-TCP applications under packet loss conditions is different. These categories are multimedia (streaming audio and video) and applications based on small transactions that don't need a lot of overhead, such as DNS. Streaming audio and video are generally not too sensitive to packet loss, although the audio/video quality will suffer slightly. For things like DNS lookups, packet loss slows down individual transactions a lot (they time out and have to be repeated), but the performance penalty doesn't carry over to transactions that didn't lose packets themselves. Because non-TCP applications don't really react to packet loss, they often exacerbate the congestion by continuing to send more traffic than the connection can handle.

Although some lost packets are the result of bit errors on the physical medium or temporary routing inconsistencies, the typical reason packets are lost is congestion: too much traffic. If a router has a single OC-3 (155 Mbps) connection to a popular destination, and 200 Mbps of traffic comes in for this destination, something has to give. The first thing the router will do is to put packets that can't be transmitted immediately in a queue. IP traffic tends to have a lot of bursts: traffic can get high for short periods of time ranging from a fraction of a second to a few seconds. The queue helps smooth out these bursts, at the expense of some additional delay for the queued packets, but at least they're not lost. If the excessive traffic volume persists, the queue fills up. The router has no other choice than to discard any additional packets that come in when the queue is full. This is called a "tail drop." The TCP anti-congestion measures are designed to avoid exactly this situation, so in most cases, all the TCP sessions will slow down so the congestion clears up for the most part. If the congestion is bad, however, this may not be enough. If a connection is used for many short-lived TCP sessions (such as web or email traffic), the sheer number of initial packets (when TCP is still in slow start) may be enough to cause congestion. Non-TCP applications can also easily cause congestion because they lack TCP's sophisticated congestion-avoidance techniques.


Queuing happens only when the interface is busy. As long as the interface is idle, packets will be transmitted without special treatment. Regular queues invariably employ the first in, first out (FIFO) principle: the packet that has been waiting the longest is transmitted first. When the queue is full, and additional packets come in, tail drops happen. More sophisticated queuing mechanisms usually employ several queues. Packets are classified by user-configurable means and then placed in the appropriate queue. Then, when the interface is ready to transmit, a queue from which the next packet will be transmitted is selected as per the queuing algorithm. Cisco routers support several queuing strategies: FIFO, WFQ, RED, priority, and custom queuing. Note that special queuing mechanisms have effect only when it's not possible immediately to transmit a packet over the output interface. If the interface is idle and there are no queued packets, the new packet is transmitted immediately.

First in, first out

FIFO queuing is the most basic queuing strategy: packets are transmitted in the same order they come in. This is the default for fast interfaces. FIFO queuing is enabled by removing all other queuing mechanisms:

interface Serial0
 no fair-queue

Weighted fair queuing

WFQ tries to allocate bandwidth fairly to different conversations (typically TCP sessions) so high-bandwidth sessions don't get to monopolize the connection. WFQ is the default for lower-bandwidth interfaces. It can be enabled with:

interface Serial0

Random early detect

RED starts to drop packets as the output queue fills up, in order to trigger congestion-avoidance in TCP. The sessions with the most traffic are most likely to experience a dropped packet, so those are the ones that slow down the most. Weighted random early detect (WRED) takes the priority value in the IP header into account and starts dropping low-priority packets earlier than their higher-priority counterparts. Unlike WFQ, priority, and custom queuing, RED doesn't need much processing time and can be used on high-speed interfaces. It needs a transmit queue bigger than the default 40-packet queue to be able to start dropping packets early and avoid tail drops.

interface Ethernet0
 hold-queue 200 out

TIP: In RFC 2309, the IETF recommends using RED for Internet routers.

Priority queuing

This queuing strategy allows traffic to be classified as high, normal, medium, or low priority. If there is any high-priority traffic, it's transmitted first, then medium-priority traffic, and so on. This can slow down lower-priority traffic a lot or even completely block it if there is enough higher-priority traffic to fill the entire bandwidth capacity. Example 6-18 enables priority queuing and assigns a medium (higher than normal) priority to DNS traffic and a low priority to FTP.

Example 6-18: Enabling priority queuing

interface Serial0
 priority-group 1
priority-list 1 protocol ip medium udp domain
priority-list 1 protocol ip low tcp ftp
priority-list 1 protocol ip low tcp ftp-data

Custom queuing

Custom queuing has a large number of queues and transmits a configurable amount of data from a queue before proceeding to the next. This queuing strategy makes it possible to guarantee a minimum amount of bandwidth for certain traffic types, while at the same time making the bandwidth that is left unused available to other traffic types. Example 6-19 assigns 75% of the bandwidth to WWW traffic, 5% to the DNS, and 20% to all other traffic.

Example 6-19: Enabling custom queuing

interface Serial0
 custom-queue-list 1
queue-list 1 protocol ip 1 tcp www
queue-list 1 protocol ip 2 udp domain
queue-list 1 default 3
queue-list 1 queue 1 byte-count 7500
queue-list 1 queue 2 byte-count 500
queue-list 1 queue 3 byte-count 2000

If there is more WWW traffic than can fit in 75% of the interface bandwidth, and the non-WWW/non-DNS traffic requires only 5%, the unused 15% is reallocated to WWW traffic so that no bandwidth is wasted.

Traffic Shaping and Rate Limiting

With traffic shaping, all the traffic for an interface, or just that matching a certain access list, is counted. This happens regardless of whether the interface is idle or packets are queued for transmission. When the traffic reaches a user-configurable bandwidth threshold, additional packets are put in a queue and delayed, so bandwidth use is limited to the configured amount.

Rate limiting, sometimes referred to as traffic policing or CAR, is similar to traffic shaping, but instead of being delayed, the excess traffic is treated differently from regular traffic in a user-configurable way. A common way to handle the excess traffic is simply to drop it, but it's also possible to do other things, such as lowering the priority field in the IP header. Example 6-20 enables traffic shaping for one interface and rate limiting for another.

Example 6-20: Enabling traffic shaping and rate limiting

interface Serial0
 traffic-shape rate 128000 8000 8000 1000
interface Serial1
 rate-limit output 128000 8000 8000 conform-action transmit exceed-action drop

Both the traffic-shape rate and the rate-limit output commands take bandwidth limit as their next argument. The other figures are burst and buffer sizes. For most applications, having those isn't desirable (TCP performance is even a bit worse when there is room for bursts), so for traffic shaping, you can leave them out; for rate limiting, you can set them to the minimum of 8000.

Traffic shaping and rate limiting are often used to limit a customer's available bandwidth when a customer buys a certain amount of bandwidth that is lower than that of the interface that connects them. This isn't a good use of rate limiting, however, because it drops a lot of packets, which makes TCP think there is congestion. So it slows down, but after a while it tries to pick up the pace again, and then there is more packet loss, and so on. Traffic shaping, on the other hand, just slows the packets down so TCP adapts to the available bandwidth. Example 6-21 shows the FTP performance over a connection that is rate-limited to 128 Kbps.

Example 6-21: FTP over a 128-Kbps rate-limited connection

ftp> put testfile
local: testfile remote: testfile
150 Opening BINARY mode data connection for 'testfile'.
100% |**********************************| 373 KB  00:00 ETA
226 Transfer complete.
382332 bytes sent in 35.61 seconds (10.48 KB/s)

The TCP performance is only 84 Kbps, about two thirds of the available bandwidth. Example 6-22 is the same transfer over the same connection, but now with traffic shaping to 128 Kbps in effect.

Example 6-22: FTP a 128-Kbps traffic-shaped connection

ftp> put testfile
local: testfile remote: testfile
150 Opening BINARY mode data connection for 'testfile'.
100% |**********************************| 373 KB  00:00 ETA
226 Transfer complete.
382332 bytes sent in 24.73 seconds (15.10 KB/s)

The performance is now 121 Kbps, which is just a few percent under the maximum possible bandwidth, considering TCP, IP, and datalink overhead.

Apart from combating denial-of-service attacks, as discussed in Chapter 11, rate limiting has another potential use, because unlike traffic shaping and the different queuing mechanisms, it can also be applied to incoming traffic. When an ISP and a customer agree on a certain bandwidth use, the ISP can easily use traffic shaping to make sure the customer doesn't receive more incoming traffic than the agreed upon bandwidth with traffic shaping. But since it's impossible to traffic shape packets coming in on an interface, the customer is responsible for traffic shaping their outgoing traffic. To make sure they don't send out more traffic than agreed, the ISP can implement additional rate limiting for incoming traffic.

Iljitsch van Beijnum has been working with BGP in ISP and end-user networks since 1996.

2. W. Richard Stevens' book TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley) has an excellent description of TCP internals.

Return to

Copyright © 2009 O'Reilly Media, Inc.