Monday, January 10, 2011

Congestion, Content Buffering and Complexity

Just as a chain is only as strong as its weakest link, a network connection can only be as fast as its slowest link. An IP network is comprised of many connections and routers, and the path that each packet takes can be change during the duration of a connection (such as downloading a web page). Depending on congestion and routing the effective transfer rate can greatly vary during a connection. Usually the weakest link for many users is the speed of their ISP access service, and is therefore where congestion is most likely.
[As an aside there is also a lot going on under the hood, so to speak, to make internet data communication work, and can also bear on congestion. The ISPs and carriers employ many network design and operations staff, automated and manual network management, and all the hardware and software (and real estate) to keep things flowing smoothly. Data transport below the IP layer -- which can include, among other things, DSLAMs, multiplexors, ATM and MPLS -- and applications above the IP layer -- HTTP, SIP, RTSP,etc. -- are unconcerned with all of this network-layer stuff, and so I will ignore it all in this article.]
Congestion is a large topic which I will not attempt to cover in this article. What I do want to discuss is one aspect of congestion management, and that is network content caching. This is, in brief, the technique of reducing network congestion by placing content closer to the user. This is accomplished with a cache of files or other popular content that the network will redirect to when requests are received. Sometimes it is explicity accomplished with mirror sites, which you have likely encountered in the past, or implicitly in a manner that is transparent to the user.

I was reminded of this when I read this article in Ars Technica about buffering. There are a few things that bother me about this article, although it is generally pretty good, since it blends together topics such as congestion, buffering, latency and caching as if they were the same rather than the closely linked but separate items that they are. I don't want to dwell on the article too closely except to, I hope, add some clarity with the following observations:
  1. Bit-km as the network loading metric: The object of mirror and cache sites is, in large part, to reduce the overall load on the internet across all of its component networks. If we exclude data compression -- the largest downloads are media files, which are already compressed -- we can only reduce network load by reduced the distance between the user and the content. If you are in Ottawa and you want to download a movie, the bit-km is lower if the content is cached, say, in Toronto rather than Los Angeles. The content is transferred to each cache once, and each local user doesn't tie up transcontinental network capacity.
  2. Congestion has a time-frame: Imagine you are at the supermarket and you are looking for a cashier. If there is one free you would rush over there, avoid congestion for your transaction. If they're all busy, with other customers queued up at every cashier, you encounter congestion. Come back a few minutes later and you may find there is, again, a free cashier. This is an example of periodically high short-term congestion but low long-term congestion. The grocer's challenge is to engineer an acceptable amount of short-term congestion (long-term congestion is almost always bad) to optimize their economic outcome by balancing their costs and your continued patronage. Networks are similar, where there is some tolerance for short-term congestion as long as long-term congestion is kept under control.
When we talk about buffering in general it is about those line-ups, whether it be for cashiers or routing to the next link in a network data path. This is more about the impact on latency since caches and mirrors don't really affect congestion in the long run. The reason is that the network operators follow good engineering economics practice by only provisioning enough capacity to meet the demand; caches and mirrors reduce the overall demand and therefore the required amount of equipment and transmission capacity. Similarly, the grocer doesn't like cashiers sitting around doing nothing since it costs money, although as a customer you certainly prefer it that way. However all customers potentially benefit in both cases since with lower costs there can be lower prices.

Unfortunately all this efficiency has its own costs. Grocers have to do manage employee numbers and schedules against predicted -- never certain -- customer demand, just as network operators have to manage choice and placement of caches against predicted demand for that content. That complexity isn't free and therefore must be carefully assessed in every situation. All these systems add complexity to the network and create a need for specialized skills to manage the complexity. This is not only costly but also creates more failure modes. There have to be compelling cost reductions before taking on the risk of going down that path.

Against that complexity is the relative simplicity of adding more network capacity. This is a less-risky choice since it means doing more of the same thing: equipment, staff and processes. In addition, the costs are more predictable if, possibly, higher. Oftentimes throwing more capacity at the problem of both short-term and long-term congestion is the superior solution, at least until the pain of doing so becomes financially unacceptable.

If you want to come up against this first hand, try to sell a network operator on installing a new type of equipment into their network to solve the congestion problem. Should you succeed, congratulate yourself on achieving a monumentally difficult objective. More often you will fail but rarely turned away cold; the person you are selling to may know the potential benefit of what you're selling but will also know the risk to them (both personal and to their employer) of choosing unwisely. It is tempting to instead call up the Cisco sales rep and order a few more blades for those routers that are already running the network just fine.

User-transparent caching (or buffering, if you prefer) sounds good in theory but can be very costly in practice. Beware discussions of this topic that fail to mention complexity, cost versus alternatives, and reliability.

No comments: