North American Network Operators Group|
Date Prev | Date Next |
Date Index |
Thread Index |
Author Index |
FWD: Explanation for the recent major downtime
- From: jc dill
- Date: Thu Sep 15 10:58:53 2005
My personal website is hosted with DreamHost. They sent this out to
their customers today. Of interest to NANOG is the bit about the N+1
redundant genset system having 2 generators quickly fail, and in doing
so having the UPS fail and the entire data center go dark. Something to
consider in your data center design...
DreamHost Announcement Team wrote:
On Monday, September 12 the greater Los Angeles area experienced a major power outage
affecting large sections of the city, including our main data center. The power outage began
shortly before 1pm PST and continued until about 4:30pm PST. Our data center is equipped
with a redundant backup power system with both battery UPS systems and diesel generators,
but the backup failed and our entire data center was powered down.
We have previously covered much of this information on our official weblog (http://
blog.dreamhost.com/) but many of you have not seen that information so we will summarize
the events here.
When the grid power to our building was cut, the UPS system kicked in and kept everything in
the building up and running. The five generators also fired up and began providing power.
The building needs four generators to operate at full power so the system is designed to
tolerate a single failure. Unfortunately, two of the five generators failed within minutes of each
other. We receive our power from the building housing our data center and they also manage
the redundant power system. We do not know the exact reason for the generator failures at
this time. We have received some vague explanations that we have not found to be satisfactory.
Regardless, the remaining three generators were not sufficient to meet the building's power
needs and that caused the emergency electrical systems to transfer into a “load shedding
mode” and the building’s UPS system to turn itself off, thus preventing permanent UPS and
related equipment damage. That shut everything down, including emergency lighting, and the
building was evacuated.
About 15 minutes later, one of the generators was started up to power emergency lighting and
a couple of our senior technicians made their way into the (still evacuated) building and down
to our data center to assess the damage. Since the backup power had failed, our own data
center power remained off until the main grid power came back. We then proceeded to slowly
power up our equipment. Servers (and all computers) consume significantly more power when
booting up than when up and running so there is some risk of overloading the power circuits if
too many of them are flipped on at once. Keeping that in mind, we powered everything on as
quickly as possible. At that time the majority of our services were fully back up and running
but some services were still down and we began the process of systematically verifying all
services and making any necessary repairs and adjustments. Whenever a large number of
servers suddenly loses power a certain small percentage of them will not come back up properly
and when you have several hundred servers it takes awhile to verify all of them.
Once our own access to our servers was restored our staff continued working into the night to
restore as much service as possible and to respond to as many of your support cases as
possible. Some of our staff continued working all the way through the night and we were able
to restore almost everything that first night.
Tuesday (September 13) started off early with all of us addressing the residual issues. At
around noon that day one of our core routers experienced an internal failure stemming from
damage previously sustained during the power outage. Our routers handle all of the Internet
traffic coming in and out of our network and they are set up in a redundant way to minimize
network disruption when a failure does occur. In this case, the main cpu of the router (called
the 'supervisor') died and the secondary one took over. Everything continued working almost as
it should have, but there is a remaining router issue that we are still working with Cisco support
on. That issue is responsible for the slower than normal performance of our network and it will
be resolved absolutely as soon as possible.
During this outage, our off-network Emergency Status Page (http://status.dreamhost.com/)
proved to be an invaluable resource for disseminating information among our customers. That
status page remained up throughout the power outage and was updated regularly as we
received new information. Unfortunately, not everyone knows about it and we will be working
to improve that situation in the coming days. Those bloggers among you that did know to
check the status page were extra helpful in passing along the information to other
dreamhosters who were still in the dark. Thank you to everyone who helped out with that!
This announcement will be followed by another explaining what went wrong with our processes
and what we plan to do to address them. That will come in the next few days.
We will be continuing to provide more detailed information on our official weblog found here:
Also, everyone who has not bookmarked our Emergency Status Page should do so now. That
page is found here:
We will be improving on the basic page we have there to provide as useful of an avenue of
information as possible.
If you have any additional questions about this outage, please let us know. We will be happy to
address all of your questions or concerns.
The Un-Happy DreamHost Powerless Team