Merit Network
Can't find what you're looking for? Search the Mail Archives.
  About Merit   Services   Network   Resources & Support   Network Research   News   Events   Home

Discussion Communities: Merit Network Email List Archives

North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

RE: Followup British Telecom outage reason

  • From: Sean Donelan
  • Date: Mon Nov 26 06:29:40 2001



On Mon, 26 Nov 2001, Christian Kuhtz wrote:
> Now, if lack of infrastructure realiability can harm human life you may feel
> differently, but that isn't the case for most of us at the present time.

I've designed software and networks used for public safety and
emergencies.  And yes, people have died on my watch. It is a somewhat
different mindset, but not that different.  A lot of "good engineering
practice" applies to any engineering activity, including software
engineering.

Its not even a matter of cost.  A typical hospital spends less on
their emergency power system than a Internet/telco hotel.  The major
difference is the hospital staff knows (more or less) what to do when
the generators don't work.

The big secret is most "life safety" systems fail regularly.  Most of
the time it doesn't matter because the "big one" doesn't coincide with
the failure.


> Faults will happen.  And nothing matters as much as how your prepare for
> when they do.

Mean Time To Repair is a bigger contributor to Availability calculations
than the Mean Time To Failure.  It would be great if things never failed.
But some people are making their systems so complicated chasing the Holy
Grail of 100% uptime, they can't figure out what happened when it does
fail.

Murphy's revenge: The more reliable you make a system, the longer it will
take you to figure out what's wrong when it breaks.






Discussion Communities


About Merit | Services | Network | Resources & Support | Network Research
News | Events | Contact | Site Map | Merit Network Home


Merit Network, Inc.