It seems like only yesterday, but it’s been 20 years now since a simple bug in a C program brought down much of AT&T’s long distance network and brought on a national phreak hunt. It was January 15, 1990: a day I’ll never forget because it was my 25th birthday, and the outage made for a rough work day. But, in retrospect, it offers a great story, full of important lessons.
The first lesson was realized quickly and is perhaps summed up by Occam’s razor: the simplest explanation is often the most likely one. This outage wasn’t a result of a mass conspiracy by phone phreaks, rather the result of recent code changes: System 7 upgrades to 4ESS switches.
There are obvious lessons to be learned about testing. Automated unit test was largely unknown back then, and it could be argued that this wouldn’t have happened had today’s unit test best practices been in place.
This taught us a lot about code complexity and factoring, since a break statement could be so easily misaligned in such a long, cumbersome function. The subsequent 1991 outage caused by a misplaced curly brace in the same system provided yet another reminder.
Finally, this massive chain reaction of failures reminded us of the systemic risk from so many interconnected systems. That situation hasn’t improved; rather, it’s much worse today. We don’t call it the internet and the web for nothing.
I was reminded of 1990 when Google recently deployed Buzz: yet another player in our tangled web of feeds and aggregators. These things are convenient; for example, I rarely login to Plaxo, but it looks as though I’m active there because I have it update automatically from this blog and other feeds. It makes me wonder if someone could set off a feedback loop with one of these chains. Aggregators can easily prevent this (some simple pattern matching will do), but there may be someone out there trying to make it happen. After all, there’s probably a misplaced a curly brace in all that Java code.
Bruce Sterling’s book, The Hacker Crackdown provides an interesting read of the 1990 failure, and he has put it in the public domain on the MIT web site. If you want a quick partial read, I recommend Part I.