Inside the Internet
|Back to Sci-Tech|
|Neal Rauhauser||February 23rd 2009|
Cutting Edge Sci-Tech Writer
|NANOG members gathering|
On Monday February 16, a global meltdown of internet routing was narrowly averted, say some informed observers of information superhighway. Few know the details or understand the dynamics.
Our world today depends on the internet. Email has largely replaced the fax, various forms of instant messaging fill some of the duties once handled by the telephone, and your local bank teller probably doesn't even recognize you at the grocery store thanks to web access to your bank account.
Everyone is acquainted with the misbehaviors at the individual level found within the internet's open, distributed architecture: mailboxes full of spam, phishing attacks aimed at your financial information and spyware are ubiquitous. Less well known are the vulnerabilities to the systems that transport traffic. Yet this backbone issue weighs on the minds of everyone involved in the operation of the internet's core.
The internet operator has a very different view of the world than that of an end user. The computer you're using to read this article has a default route â€“ most likely to a router that protects your PC from the outside world. The network operator has no such luxury, but instead must have their own autonomous system and use defaultless routing. All large service providers have an autonomous system number for which they pay a $30 yearly fee to the American Registry for Internet Numbers. This is a sort of social security number for internet service providers, uniquely identifying any routing information they publish.
Once a carrier has their autonomous system, they go about developing connections with other internet providers, a process known as peering. Each peer is offered information regarding the internet network space that is within their network, using their autonomous system number to identify their offerings.
On February 16, 2008 at about 11:25 Eastern time, an internet provider in the Czech Republic named Supro made a configuration error. Every destination on the internet passes through a series of autonomous systems and it is very common for providers to pad the information they offer with additional instances of their number. The effect of this is to make a destination further away in terms of network cost, a technique that is used to balance traffic when a provider has two peers with different levels of access to the rest of the world. Path length in normal operation is at least two or three hops and perhaps as many as two dozen. Supro's offerings had some 255 hops, overflowing the software counters in some versions of carrier router software and causing them to reset.
The resulting dance of peering connections forming, running until they received a Supro announcement, and then resetting caused a global disruption in internet traffic. It wasn't as serious as 1997, when a Kentucky internet service provider, lacking an official autonomous system, chose to use autonomous system number zero. That innocent error knocked out most of the internet for a period of eighteen hours.
There is no Federal Bureau of Internet service, nor is there a European Center for Internet Stability. The concept of peering is used with the physical links between providers and it is extended to the overall management of the system. When there is trouble in the world of defaultless routing, the members of the North American Network Operations Group swing into actions. The North American Network Operations Group is known among internet engineers by the intriguing and futuristic acronym NANOGâ€”but it is real.
The response to the February 16 event was swift. Discussion began twenty minutes after the problem was first observed. Twenty-five minutes later, a NANOG specialist posted an example configuration that would filter all Supro announcements. Eighty-five minutes into the event, a Czech member of NANOG reported that he'd contacted the peers of Supro and they pulled their connection until repairs could be made. The global internet thrives on some 285,000 leading destinations managed from almost 31,000 separate autonomous systems. When an event of this magnitude occurs it takes nearly an hour for stability to return, that is, once the problem is removed.
Twenty-six hours after the event, Ivan Pepelnjak, a researcher and text book author, posted a terse article describing the problem, workarounds, and a description of ongoing vulnerabilities. This generated a follow-on discussion that will lead to improved best practices for autonomous system operators. Equipment vendors began to comment and the staff from providers large enough to have their own quality assurance people provided them with specific testing information from live environments. A consensus twenty-nine hours after the event emerged: there are multiple vulnerabilities in every different version of every carrier router manufacturer's software. In other words, the worldwide Internet itself is massively at risk, especially should a coordinated attack take place.
The internet's defense against this vulnerability today is mostly based on â€œreputation economics.â€ No provider large enough to be a peer to others wants to be seen as the source of global email troubles. Further problems may arise from this cluster of bugs during the period of time the equipment vendors require to create, test, and distribute software updates.
But billions of internet users need not lose any sleep over this danger. The men and women of NANOG are always ready.
Neal Rauhauser is an analyst and consultant on energy and telecommunications. He is a member of the Stranded Wind Initiative and can be found at www.strandedwind.org.