How many 9s are enough?

How many 9s are enough?

By Scott Bradner

Network World, 08/24/98

If there is one thing telephone people are insistent on, it is reliability -
at least in their demands on equipment. The common belief among
phone people who are trying to build data networks is that equipment
needs to be "five nines" (99.999%) reliable in order to be useful in a
network they want to build. I think they are wrong to want this level
of reliability in data networking equipment, and I fear their insistence
on this level is inhibiting their deployment of useful data networks.

It will be interesting to see if they maintain this belief in the future, in
which case they will have to compete against other providers for
customers. It is currently easy for the traditional phone company to
insist on reliability at great cost because it exists in a world where
increased cost means increased revenue being authorized by the local
utility commissions.

But utility-commission-distorted economics aside, I think the problem
is that the people who are insisting on five nines do not understand
data networking.

Back in 1964, Paul Baron, then at Rand Corp., produced a series of
articles proposing the idea of packet-switching nets. (The papers were
recently posted online.) Baron was working at a time when there was
considerable worry about the destruction of the U.S. communications
infrastructure by enemy action. He proposed a network design that
would survive large-scale node or link destruction. His design was
for a distributed network with many small cheap packet switches and
many redundant links between them instead of the then-common
network design that had a few large phone circuit switches. He
showed that when reliability was measured end to end, a distributed
net would exhibit very high reliability even in the face of the failure of
a number of the switches or links in the network. He concluded,
"From the user's viewpoint, the system appears to be virtually noise-
and error-free when handling data." He was describing the current
Internet architecture long before its time.

A key reason to use a distributed network is to minimize the reliance
on any single network component. The network will route around
link or switch failures. In this type of environment, five nines
reliability is overkill. But it's not a surprise phone types think in terms
of the need for extreme reliability - they generally don't have
distributed networks with redundant paths.

There are places in many ISP networks where redundancy is not as
rich as it might be - the link to the customer site for example. And in
many ISPs, the level of traffic is such that routing around a failure
will cause congestion and data loss. But Internet-style networks are
not the same as telephone-style ones, and the reliability demanded
from each component should not have to be as high because the net
will cover up for a lack of reliability due to redundancy in most cases.
Less expensive, reasonably reliable switches may not result in less
reliable service to the customer.

Disclaimer: Harvard spends more time understanding the reliability of
people than electronic components, so the above postulation is mine.