2012-06-05 Mail problems

Some time after 0800, reports of errors with Webmail were received. Hungyon arrived on site at about 0830 & confirmed tha webmail would fail & report some error. She initiated a reboot at 0847. However, the reboot decided to fsck the disks, so the system was not up & running until 1000.

The system appears to be running normally. But there were numerous errors in the messages log concered SCSI disks ( see below) both before & during the reboot. The kernel's trouble seems to have started at 0316, which more or less coincides with events on the Storage Array at 0313.

Storage Controller A has a amber LED lit on front panel. The SMClient reported an service needed -- but following the details it was refering to the Battery nearing Expiration. The batteries in both controllers are dur to expire in 2012-06-06+7d. Should we order new ones? The batteries protect the write cached data in event of power loss. However, perhaps more serious, is the info in the "Events Log" concerning a failed SAS Port. NB I cannot find a "PHY LOG".

mail:/var/log/messages (excerpts) SCSI errors from 3 AM.

Jun  5 03:16:52 mail kernel: sd 2:0:0:1: SCSI error: return code = 0x00020000
Jun  5 03:16:52 mail kernel: end_request: I/O error, dev sdf, sector 845704641
Jun  5 03:16:52 mail kernel: printk: 20 messages suppressed.
Jun  5 03:16:52 mail kernel: Buffer I/O error on device dm-2, logical block 211426065
Jun  5 03:16:52 mail kernel: lost page write due to I/O error on dm-2
Jun  5 03:16:52 mail kernel: Aborting journal on device dm-2.
Jun  5 03:16:52 mail kernel: sd 2:0:0:1: SCSI error: return code = 0x00020000
Jun  5 03:16:52 mail kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jun  5 03:16:52 mail kernel: end_request: I/O error, dev sdf, sector 845730313
Jun  5 03:16:52 mail kernel: Buffer I/O error on device dm-2, logical block 211432483
Jun  5 03:16:52 mail kernel: lost page write due to I/O error on dm-2
Jun  5 03:16:52 mail kernel: sd 2:0:0:1: SCSI error: return code = 0x00020000
Jun  5 03:16:52 mail kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
Jun  5 03:16:52 mail kernel: end_request: I/O error, dev sdf, sector 845760193
Jun  5 03:16:52 mail kernel: Buffer I/O error on device dm-2, logical block 211439953
Jun  5 03:16:52 mail kernel: lost page write due to I/O error on dm-2
Jun  5 03:16:52 mail kernel: sd 2:0:0:1: SCSI error: return code = 0x00020000
Jun  5 03:16:52 mail kernel: end_request: I/O error, dev sdf, sector 497
Jun  5 03:16:52 mail kernel: mptbase: ioc1: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)

from SMClient events logs

Date/Time: 6/5/12 3:13:50 AM
Sequence number: 7547
Event type: 1707
Description: Degraded wide port becomes failed
Event specific codes: 0/0/0
Event category: Error
Component type: Enclosure Component (ESM, GBIC/SFP, Power Supply, or Fan)
Component location: Enclosure 85, Slot 2
Logged by: Controller in slot A

From SMClient Help pages
EventDescAction
1707 Degraded wide port becomes Failed A SAS port has been marked as failed. A degraded port is usually caused by a faulty cable, environmental services monitor (ESM), controller, disk drive, or enclosure connector. Analysis of the PHY Error Logs might help isolate the problem and indicate which component must be replaced.

From the Recovery Guru...

Storage Subsystem: PHAS_MailStore 
Component reporting problem: Battery 
Status: Near expiration 
Location:  Controller enclosure 85, Controller in Slot A 
Smart battery: Yes 
Component requiring service: Controller A 
Service action (removal) allowed: No   
Service action LED on component: No