Merit Network
Can't find what you're looking for? Search the Mail Archives.
  About Merit   Services   Network   Resources & Support   Network Research   News   Events   Home

Discussion Communities: Merit Network Email List Archives

Merit Joint Technical Staff

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
major dial-in outage 12/3

  • From: William Bulley
  • Date: Fri Dec 06 10:50:47 1996

There was a several hour dial-in outage on Tuesday, December 3rd, which
affected at least the Ann Arbor huntgroups and possibly others including
some lines in East Lansing.

The exact cause of this outage is still under investigation.  The problems
started when some remote LAS machines failed to respond to large numbers of
incoming authentication requests.  We have some reason to suspect that, at
least in some cases, this was due to other services downstream from those
LAS machines.  In the case of the University of Michigan LAS, these services
include Kerberos, ABS (an accounting system) and PTS (an administrative
grouping system).

There are several authentication helpers (server) state-wide, however only
three of these were affected.  One in Ann Arbor which handles the main UofM
huntgroups, one in East Lansing which services many MSU huntgroups, and a
backup helper in Ann Arbor, which was simply overloaded and unable to help
when its help was most needed.

We have recently instituted new operational procedures to offload dial-in
logfiles (which accumulate on these helpers) to a centralized statistics
facility in Ann Arbor.  These procedures turned out to be more CPU intensive
than we would have desired and may have contributed to the above problems.

To avoid a similar problem happening in the future, we plan to reduce the
time we will wait for a response to an authentication request from a remote
LAS.  In addition, if the response never comes from a remote LAS, the request
will be rejected with an appropriate message indicating the problem.

We have changed the logfile offloading procedures to be less costly of
CPU time and to alter their frequency and the manner in which they are
invoked.  We believe these steps will limit the queuing backlogs we saw
on this past Tuesday.  We will monitor the servers' performance closely
to be sure the above steps provide the improvements we expect.

In the longer term, we are investigating increasing the hardware and disk
capacity on the core helpers to handle the ever-increasing load placed on
them.

Regards,

web...

-- 
William Bulley, N8NXN              Senior Systems Research Programmer
Merit Network, Inc.                Email: web@merit.edu
4251 Plymouth Road, Suite C        Phone: (313) 764-9993
Ann Arbor, Michigan  48105-2785    Fax:   (313) 647-3185

[ What's all the fuss over the end of the century with mission critial ]
[ programs failing due to dates?  If people simply started using Roman ]
[ Numerials the problem vanishes!  MCM = 1900 MCMXCIX = 1999 MM = 2000 ]

- - - - - - - - - - - - - - - - -




Discussion Communities


About Merit | Services | Network | Resources & Support | Network Research
News | Events | Contact | Site Map | Merit Network Home


Merit Network, Inc.