Monday, January 10, 2011

Something New to Fear: Bufferbloat

Something today directed me to Jim Gettys' recent posts on what he calls "bufferbloat," which, he believes, is a pervasive and mostly undiagnosed problem which leads to our broadband networks functioning much more poorly than their original design and specifications would suggest. This is pretty technical, and I barely understand it myself, but let me try to explain, if for no other reason than to test my understanding.

When you access a web page over the internet (for example), your computer and the web server hosting the site you're requesting are exchanging "packets" of data. The original Brainiacs who designed the internet already knew that network congestion would be a problem, so the built into the system a feedback mechanism so that the rate at which both your computer and the web server would spew out packets at each other would adjust based on the speed of the connection or the number of packets lost somewhere in the network.

Basically, if the network is slow, both computers will slow down the rate at which the flood packets into the system, to help prevent a massive traffic jam. If it is a fast connection, they'll speed up to maximize the throughput.

The problem is buffers. Memory has gotten cheaper and more plentiful much, much more quickly than bandwidth, so there has been a temptation to add memory buffers at almost every point in this system. The basic idea is this: let's say you've got a steady stream of packets going between two computers at a given rate. Then, there's a hiccup and the network stops for a 10th of a second. What you'd like to have happen is that while the network is stopped, the flow of packets is stored up in a buffer, and then once the flow returns to normal, you start pushing the packets out of the buffer, and ideally, catch back up. Buffers smooth out the flow of data. It's a holding tank. Nothing too sinister.

The problem is, everything has a buffer now. Your operating system, your ethernet card, your wireless access point, your ISP's routers, etc., etc., and a lot of them are pretty big, because memory is cheap. So your computer starts sending out a steam of packets, and they're going really fast, and your computer thinks "Awesome, we've got a super-fast connection," and pushes out a ton of packets.

Unfortunately, it has really just filled up a bunch of buffers -- and the buffer in your home router/access point is likely a particular offender. So, for the sake of simplicity, just imagine that your computer has pumped ten million data packets into your access point as fast as it could for five seconds, thinking they were all flying toward another distant computer, when they were really just piling up in the access point's buffer.

Meanwhile the access point is pushing them out at into the ISP's network (cable, FIOS, whatever) at a slower rate. And to make things worse it turns out that there are some other problems further down the line, and some packets are dropped. So then the router lets the computer know that there was a problem with packet 10,273. And the computer says "WTF! You could have told me that before I sent you the other 10 million packets!" So then this mess has to be straightened out, things get out of order, blah blah, it is bad.

So the fundamental feedback mechanism that was supposed to make the internet degrade smoothly has been inadvertently circumvented. What you get as a result is a much "burstier" network performance. Things look like they're going fast, then they stop, then they go fast again. You are likely to see good performance or horrible, with nothing in between. This is bad for everything dependent on computer networks, particularly things like voice over IP, gaming and other technologies dependent on low latency and reliability; and it is worse on less reliable networks like cellular networking.

The whole thing reminds me of the Y2K hypothesis, that a simple, pervasive problem -- particularly one hidden in device firmware -- could cause unpredictable systemic problems that would be extremely difficult to solve. This one is particularly funny because the most important fix may be to just change the default settings on millions of computers and home routers.

Or, perhaps this is just way over my head, and I have no idea what I'm talking about. Stay tuned.

1 comment:

David Täht said...

Got it in one, with a good set of analogies!

The only thing I'd add to your commentary is that once a network is in congestion collapse, other critical packets can't do their jobs, either.

This includes DNS - adding hundreds of ms of latencies to turning a name like your website into an ip address is not good.

With a typical web page doing dozens, even hundreds of DNS lookups, DNS not getting through in a timely fashion on a congested network results in vastly slower browsing.

NTP - the network time protocol - relies on somewhat timely delivery of packets in order to keep your computer's clock in sync.

ARP - the address resolution protocol also relies on timely resolution in order to find other devices on your network.

DHCP - if these packets are lost or excessively delayed, machines can't even get on the network in the first place.

Routing - many routing protocols - the stuff that detects failures in the network - when intolerably delayed can cause even larger problems.

(most) VOIP needs about a single packet per 10ms flow in order to be good, and less than 30 jitter.

Gamers will get fragged a lot more often with latencies above their twitch factor.

That's just my top 6. At the moment I'm doing tests as to what I can do to improve streaming radio....