Sequence of Events – WARP Failure, Dec 30th , 2005




#

Date & Time

Description

1

Fri Dec 30th, time?

Warp A/C power module dies (only piece on warp that is not redundant).

2

Sun Jan 1st,6:21 pm

Gerry became aware of problem and sent email to itstaff.

3

Sun Jan 1st, 10:57 pm

David Jones sends email:
“I can't ftp or ssh to phas for the past few days. Are any of you having the same problem? I need to get access to my public_html directory and the ~phys533 directory.”

4

Mon Jan 2nd , 9:07am

David Jones sends email:

“ping to phas fails...”

5

Mon Jan 2nd , ~10am

Ron emailed to DJ:

physics has died. If you are in the Hennings building, you can go to room 205 and login in to LTS1 using the X-terminals on the left-hand side of the room. That will give you access to the files you need. You could also get access to your ~public_html files via samba. I could also set up a PHYS533 samba share if you need it.”

6

Tues, Jan 3rd

AM: Ron came in and diagnosed problem, tried a couple of fixes. Determined A/C Input power module had died. All other modules appeared OK. Although there are redundant power supplies on warp, this A/C Input module is not redunant.


Ron spent some time trying to get DJ connected to drives on his PC.


Ron began trying to source part for warp. 9:15am Contacted Sun, sent email to SNAG, trolled ebay.


Mary Ann put a link on our Webserver Homepage advising of situation with warp.


10:12am Ron sends email to everyone list advising of failure:

physics (our main login server, application server and print server) has had a major hardware failure and is out of service.  All other servers (eg the mail and file servers) are funtioning normally.  I am in the process of getting replacement parts but it will probably take a few days to get them.
In the meantime, I will be setting up another computer that will function as a print server and general login server ( ssh / pine ).

If you normally read your email using pine on physics, you will have to do one of the following:
1. Install pine on your local computer (windows/linux versions of pine are available).
2. Install almost any other email client package on your computer - we recommend thunderbird (
www.mozilla.com).
3. Use webmail.

Instructions are on our web pages for configuring various email clients.

You should be able to connect to your home directories normally via samba shares.
If you need access to special directories, or have any questions or concerns, please contact one of the computer systems staff.”


PM: Ron had alternate server in place that people could ssh into.


Gerry and Ron discussed printing situation (still not working) and fact that Ron had a scheduled vacation day planned for Wed Jan 4th. Gerry and Ron agreed printing could wait until Thurs AM when Ron returned to work.

7

Wed Jan 4th

Ron had a vacation day.

8

Thu Jan 5th

Ron continued search for part for warp. Finally located one from supplier in Brampton. Had him ship it as fast as possible to us.

9

Fri Jan 6th

Part arrives. Tom phones Ron and he comes in to work and installs part. System comes up normally.



Particular Problems encountered:


#

Problem

Discussion / Resolution

1

No alternate remote login server (ssh access).


- Affected users that normally login remotely to warp to read their email (pine).


- Affected users that normally login remotely to warp to get access to their home, mail, and web directories.

People in Hennings could login to “lts1” (Linux server) from the Xterminals in Henn 205 and use pine or connect to their directories on the file server (alpha).


Remote users could connect to their HomeDirs (and any other shares that had been setup for them) if they connected using the PhysicsVPN.


Remote users had no machine they could ssh to.


We already had (and still have) plans in place for an alternate login server (physics replacement) but it hasn't yet been purchased. Priority on this will be escalated.


We have implemented a temporary alternate ssh login server (deneb).

2

Printing


- Our print server was down so no one could print.

We use LPRng / ifhp software to manage print queues and do printer accounting. Installing this software on the replacement Linux server went smoothly, except for their being a bug in the ifhp code which meant we couldn't communicate with the printers properly to acertain EOJ or do printer accounting. Late on Tues I didn't have this working and after discussing it with Gerry, we decided it could wait until Thurs morning.


Thursday morning, I reconfigured the code to communicate with the printers using (very slow) Postscript code. This allowed printing to proceed but without accounting.


We are planning to switch to CUPS / PYKOTA software (much more modern and the default on most Linux servers) and will set up a failover print server.