Internet Routing Meltdown

To listen to the experts running the large backbone networks on the Internet, the entire routing architecture of the Internet is on the verge of collapse. While that might be somewhat hyperbolic, I have enough knowledge and experience in that particular arena to know that their concerns are certainly valid.

Currently, the Border Gateway Protocol (BGP) carries the core routing information across the Internet backbones and even all the way out to many edge networks. To understand why BGP is on the edge of melting down, it is necessary to know something about how IP routing works.

IP routing works using a process colloquially known has hot potato routing. That is, each node (router) on the network receives a packet and immediately lobs it in the direction of node (router) which is the next closest to the destination. If the forwarding node has no idea where to send the packet, it drops it on the floor and maybe notifies the sender. What this means, however, is that every node the performs routing must have a consistent view of the entire Internet topology. Otherwise, packets can end up ping-ponging between nodes in routing loops and other unpleasant results. The IP specification does not say anything about just how these routing tables are shared, however.

The current solution to maintaining a mostly consistent view of the Internet routing table across the network is BGP. BGP has served very well for many years. It assumes the Internet consists of a series of autonomous systems (or ASes) each of which is a single administrative domain and which is internally connected. From BGP’s perspective, an AS is a black box. Various address prefixes will associate with an origin AS. BGP uses the AS (the sequence of ASes that must be traversed to reach the destination) as its primary metric for determining a best path.

There are several problems with the current routing system. The first is the requirement that every router have a complete (or at least consistent subset) of the Internet routing table. This means that as the number of routing prefixes in the global table increases, so does the required size for the forwarding information base (FIB) or the routing table used by a node to actually look up next hop addresses. Contrast that with the routing information base (RIB) which contains all the information about all known routes available to a node. Both take up resources but unlike the RIB which can be in slower, less predictable memory, the FIB must be able to make fast lookups in order for a high end router to move packets around at a sufficient speed. Thus, as line speeds get faster, the latency requirement for the FIB gets tighter while the size of the FIB must get larger.

The other problem is that because the whole network must have a consistent view, any time there is a topology change (because a link fails, for instance), there is a period of time (the convergence window) during which routing will be flaky. Obviously the shorter the convergence window, the stabler the network. On the other hand, as the number of ASes and prefixes in the network increases, the amount of information that must be exchanged to maintain the routing state increases. Indeed, the larger the overall network size, the larger the baseline level of churn in the topology, increasing the number of messages passed around. Also, as the RIB increases in size, the length of time to run through it to update everything increases. Both of these factors will tend to increase the convergence time when topology changes. Additionally, as more and more ASes maintain more interconnects, this also increases the number of paths in the RIB for each of those ASes, further exacerbating the problem.

The current system worked very well in the days when the number of ASes was measured in the low hundreds and the number of network prefixes was measured in the thousands. It even scaled fairly well up to one hundred thousand routes in the global table. And it worked fairly well up to double that. But as the route table increases in size, more and more instability starts to appear and more and more routers end up running against the edge of their capabilities or network operators end up spending a lot of money to upgrade them.

This problem is further exacerbated by the fact that IPv4, which is the IP that has been under discussion to this point, is rapidly approaching endgame status. That means that the same routers that are running into trouble handling IPv4 must also simultaneously handle IPv6. That means the overall resources available for IPv4 must be shared with the IPv6 tables as well.

A short aside on IPv6 is useful here. The same problems that exist in IPv4 also exist in IPv6. The current IPv6 routing table is under five thousand routes and is expected to remain in the low to mid five digit range for some time. However, the Internet is growing and it is almost certain that the routing tables in IPv6 will only grow over time eventually leading back to the same place the network is with IPv6. For the remainder of this discussion, I will ignore the version differences since the solution will likely be the same for both protocol versions.

Now, the problem as I see it is that any system that must maintain a globally consistent view of the network in a proactive manner, such as the current BGP routing regime, is doomed to scalability problems when faced with an ever growing network with a constant nontrivial rate of churn. Anything that requires maintaining a state table that grows over time without bound is clearly not sustainable.

It seems obvious that the solution is to be more reactive than proactive. Instead of maintaining a consistent view of the entire network, only learn about the parts of the network you actually need to communicate with. Since there is likely to be a substantial amount of topological locality, this could substantially reduce the load on the FIB in many networks. Then, the RIB would only need to be concerned with tracking updates to items actually queried over time. Thus, resources for both the FIB and RIB would be reduced. However, this introduces an unavoidable delay during communication with a destination that is currently unknown. This is a delay that is normally not present under the current regime.

There will, however, be nodes or networks that communicate widely which will very quickly build up a substantially complete view of the entire Internet. However, only those areas which need this information will be required to invest in the resources to handle it. Also, in an on demand environment, it would be conceivable that in a large network, multiple discrete devices could split the routing load automatically though careful cooperation.

Changing the routing regime, however, is not something that can be done easily. It has been decades since it was possible to have a flag day when every network operator would cut over from one protocol to another. There are too many separate entities in too many separate countries for this to be even remotely possible even if the benefits of doing so are obvious to all. That means that any new routing protocol regime must be incrementally deployable.

It is conceivable that a new routing regime could be constructed such that the existing BGP protocol could used as a source of routing information but not introduced into the FIB in the relevant routers. This would allow islands of the network which operate independently of the BGP core but which pass information into and receive information out of the BGP core at special edge routers which speak both protocols. This would, of course, require that the new on demand protocol would have to maintain the concept of autonomous systems but that is a concept that seems to work well.

One advantage to simply replacing BGP with an on demand protocol would be that end systems would not need to know about the new routing regime. Other proposals such as NIMROD have assumed that the end node should have full knowledge and control of routing through the network. But the end node does not have sufficient knowledge of network conditions to make an informed decision about such things. Also, by shifting the routing decision to the end nodes, one must do a wholesale update to possibly billions of systems rather than just updating routers. And by replacing BGP instead of everything, only those routers that speak BGP need to be concerned with the change. Networks relying on static routing or simply worrying about a default route need not be changed in any way.

It is, however, unclear how well this type of scheme would hold up for core networks that have very high levels of interconnectivity combined with heavy traffic to all sorts of random directions. These networks will still require high level equipment with large resources for both the FIB and RIB. However, many smaller networks that are merely participating in the system to gain redundancy and resilience but which have relatively controlled traffic patters will benefit immensely from an on demand routing system as they will be able to survive with much less capable hardware.

One final consideration should be examined. An on demand system that takes the place of BGP would still maintain the hot potato routing paradigm. However, each hop along the way would maintain its own FIB and RIB using the same on demand methodology. That means that the setup delay for a communication path that has not been used at all may be fairly high. However, once that path is discovered, as long as it is in use, it will likely remain quick. However, to prevent the size of the FIB and RIB from growing to the upper bound of the network size, and degenerating to today’s current problem, there must be a process by which entries in the FIB are expired. That means that if an entry expires, the setup delay will occur again. The precise expiry mechanism should be tuned by a particular site to minimize the churn in important destinations. Of course, as noted earlier, this new path setup latency is not present in the existing network regime.

Since the current proactive regime is provably not scalable (such proof left as an exercise for the reader), it is clear that some sort of change is required. That change will require some sort of compromise. A regime that maintains a view of the entire network state across the whole network is what allows minimal setup time for connections on newly used paths. Since that structure is not sustainable, a compromise must be made. A short delay during the initial communication with a previously unknown destination seems like a small price to pay for improved network scalability.

Let me finish by saying that I have not done any specific research or testing so it is entirely possible that the scheme I envision does not actually improve matters on the Internet. A lot of very smart people are working on the problem and they have been unable to agree on a solution after many years of wrangling.

As an aside, it seems most of the work on routing protocols is focused exclusively on wireless networks. Let me urge the folks working on routing protocols for the wireless domain to stop stating that their protocols are intended for wireless networks. It follows that any protocol that works on a notoriously unreliable medium like wireless will also work on more reliable media like wired network connections!

Leave a Reply

Your email address will not be published. Required fields are marked *