Sequence of Events – WARP Failure, Dec 30th , 2005
|
# |
Date & Time |
Description |
|
1 |
Fri Dec 30th, time? |
Warp A/C power module dies (only piece on warp that is not redundant). |
|
2 |
Sun Jan 1st,6:21 pm |
Gerry became aware of problem and sent email to itstaff. |
|
3 |
Sun Jan 1st, 10:57 pm |
David Jones sends email: |
|
4 |
Mon Jan 2nd , 9:07am |
David Jones sends email: “ping to phas fails...” |
|
5 |
Mon Jan 2nd , ~10am |
Ron emailed to DJ: “physics has died. If you are in the Hennings building, you can go to room 205 and login in to LTS1 using the X-terminals on the left-hand side of the room. That will give you access to the files you need. You could also get access to your ~public_html files via samba. I could also set up a PHYS533 samba share if you need it.” |
|
6 |
Tues, Jan 3rd |
AM: Ron came in and diagnosed problem, tried a couple of fixes. Determined A/C Input power module had died. All other modules appeared OK. Although there are redundant power supplies on warp, this A/C Input module is not redunant.
Ron spent some time trying to get DJ connected to drives on his PC.
Ron began trying to source part for warp. 9:15am Contacted Sun, sent email to SNAG, trolled ebay.
Mary Ann put a link on our Webserver Homepage advising of situation with warp.
10:12am Ron sends email to everyone list advising of failure: “physics
(our main login server, application server and print server) has
had a major hardware failure and is out of service. All
other servers (eg the mail and file servers) are funtioning
normally. I am in the process of getting replacement parts
but it will probably take a few days to get them.
PM: Ron had alternate server in place that people could ssh into.
Gerry and Ron discussed printing situation (still not working) and fact that Ron had a scheduled vacation day planned for Wed Jan 4th. Gerry and Ron agreed printing could wait until Thurs AM when Ron returned to work. |
|
7 |
Wed Jan 4th |
Ron had a vacation day. |
|
8 |
Thu Jan 5th |
Ron continued search for part for warp. Finally located one from supplier in Brampton. Had him ship it as fast as possible to us. |
|
9 |
Fri Jan 6th |
Part arrives. Tom phones Ron and he comes in to work and installs part. System comes up normally. |
Particular Problems encountered:
|
# |
Problem |
Discussion / Resolution |
|---|---|---|
|
1 |
No alternate remote login server (ssh access).
- Affected users that normally login remotely to warp to read their email (pine).
- Affected users that normally login remotely to warp to get access to their home, mail, and web directories. |
People in Hennings could login to “lts1” (Linux server) from the Xterminals in Henn 205 and use pine or connect to their directories on the file server (alpha).
Remote users could connect to their HomeDirs (and any other shares that had been setup for them) if they connected using the PhysicsVPN.
Remote users had no machine they could ssh to.
We already had (and still have) plans in place for an alternate login server (physics replacement) but it hasn't yet been purchased. Priority on this will be escalated.
We have implemented a temporary alternate ssh login server (deneb). |
|
2 |
Printing
- Our print server was down so no one could print. |
We use LPRng / ifhp software to manage print queues and do printer accounting. Installing this software on the replacement Linux server went smoothly, except for their being a bug in the ifhp code which meant we couldn't communicate with the printers properly to acertain EOJ or do printer accounting. Late on Tues I didn't have this working and after discussing it with Gerry, we decided it could wait until Thurs morning.
Thursday morning, I reconfigured the code to communicate with the printers using (very slow) Postscript code. This allowed printing to proceed but without accounting.
We are planning to switch to CUPS / PYKOTA software (much more modern and the default on most Linux servers) and will set up a failover print server. |