[BUGS] SCSI hardware failure?

Tong Wang tong at hsa.com.au
Mon Jan 14 14:43:40 EST 2008


Thanks for the reply Callum, and Jonathan

sorry for being unclear of the overall picture. ok here is more info:

OS - FreeBSD 5.4-STABLE

Server - IBM eServer346 - M8840 (Attached is the dmesg from system boot)
 
Callum Gibson wrote:
> On 14Jan08 11:21, Tong Wang wrote:
> }Copied 32 bytes of sense data offset 12: 0x70 0x0 0x1 0x0 0x0 0x0 0x0 0x18 
> }0x0 0x0 0x0 0x0 0x5d 0x2 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0xff 0xff 
> }0xff 0xff 0xff 0xff 0x0 0x0
> }(da0:ahd1:0:0:0): READ(10). CDB: 28 0 1 f5 4 3b 0 0 4 0
> }(da0:ahd1:0:0:0): CAM Status: SCSI Status Error
> }(da0:ahd1:0:0:0): SCSI Status: Check Condition
> }(da0:ahd1:0:0:0): RECOVERED ERROR asc:5d,2
> }(da0:ahd1:0:0:0): Reserved ASC/ASCQ pair
> }(da0:ahd1:0:0:0): No Recovery Action Needed
> }
> }I've had this for quite a while and it's repeating, since it says "No 
> }Recovery Action Needed", nothing much has been done. Until recently, I 
>
> SCSI drives have their own bad sector management so you really do need
> to do nothing. However if you are seeing this message frequently then
> it's an indication your disk is dying and you will need to replace it
> before it goes completely and has unrecoverable errors.
>
>   
Yep I see this message at least 100 times a day, so yeah I think it's 
time for a disk change.
> }found that AMANDA is not able to send daily check emails from amcheck, 
> }amverify etc, so I checked the server again, using mail to trying to send 
> }an email from the AMANDA account, and got the following error:
> }
> }mail: /tmp/mail.Rsi6mQUtTJ52: Permission denied
> }
> }more to it, upon finishing, amverify also has this permission denied error.
> }I also noticed that i can't run finger on any account, same permission 
> }denied error happens, even using root account.
> }
> }To sum it up, this server is having all sorts of strange behaviors which 
> }I'm guessing are from the SCSI error, but I'm not sure what exactly is 
> }going on with it. I've worked with FreeBSD for not too long, so some help 
> }would be greatly greatly appreciated! :)
>
> It's unlikely to be caused by your SCSI issue unless you had errors which
> weren't recovered. I'm not sure that's even possible as the hardware should
> cause some sort of more major error which the kernel will report, rather
> than silently corrupting your data (such as file permissions, etc).
>
> You'll need to provide some more info on the details of your installation
> for other ideas. What release of FreeBSD? Are you using NIS? How long
> has the machine been up? Has this all been working perfectly before and
> for how long? What other changes have you made recently? etc. 
>   
The server is a backup file server and NIS server, as well as running 
the AMANDA backup software (version 2.4.5). It has 6 SCSI disks stripped 
and mirrored, and 2 SCSI backup tape drives It runs 2 daily tape backups 
using AMANDA, and runs amcheck every hour using cron to check if the 
correct tapes are in the drives. Upon successful backup, AMANDA normally 
sends out an email for each backup configuration with the summary report.

The server has been up for nearly 2 years, everything was running with 
no problem until last Friday, when AMANDA stopped sending emails to me 
all the sudden. I tested it it by typing:

su amanda
mail -s test -F tong at hsa.com.au

and here is the result:

mail: /tmp/mail.RspIDEqoLfkF: Permission denied

When running amverify which is provided by AMANDA, it always sends a 
summary report email to my email address, and other than the normal 
output it showed the similar Permission denied error at the end of the 
summary which is displayed in stdout. However, when sending mails from 
root account, it works. But I got permission denied error when su to 
other user accounts from NIS.

I am really trying to get AMANDA emails back. As for the reason why I 
relate the disk error with this, because I tried to reboot the server, 
with the first try failed saying:

Missing Operating System

On the second try, it halted half way, with the following message:

Drive on AIC-7902 B at slot 00, 08:07:01, SCSI ID: 0 has exceeded 
failure prediction threshold.

This is why I relate these two issues together, maybe I'm wrong though.

df gives the following:

Filesystem             1K-blocks      Used     Avail Capacity  Mounted on
/dev/mirror/gm0s1a        495726    107454    348614    24%    /
devfs                          1         1         0   100%    /dev
/dev/mirror/gm0s1d       2026030     78516   1785432     4%    /var
/dev/mirror/gm0s1e       2026030    997190    866758    53%    /tmp
/dev/mirror/gm0s1f      10154158   1415242   7926584    15%    /usr
/dev/stripe/stripe0s1a 800991544 576215642 216765988    73%    /export

and camcontrol devlist give the following:

<HP Ultrium 2-SCSI S24D>           at scbus0 target 2 lun 0 (sa0,pass0)
<HP Ultrium 2-SCSI S24D>           at scbus0 target 3 lun 0 (sa1,pass1)
<IBM-ESXS GNS300C3ESTT0ZFN JP85>   at scbus1 target 0 lun 0 (pass2,da0)
<IBM-ESXS GNS300C3ESTT0ZFN JP85>   at scbus1 target 1 lun 0 (pass3,da1)
<IBM-ESXS PYH300C3-ETS10FN RXQE>   at scbus1 target 2 lun 0 (pass4,da2)
<IBM-ESXS GNS300C3ESTT0ZFN JP85>   at scbus1 target 3 lun 0 (pass5,da3)
<IBM-ESXS GNS300C3ESTT0ZFN JP85>   at scbus1 target 4 lun 0 (pass6,da4)
<IBM-ESXS GNS300C3ESTT0ZFN JP85>   at scbus1 target 5 lun 0 (pass7,da5)
<IBM 25R5170a S320  0 1>           at scbus1 target 8 lun 0 (ses0,pass8)
<HL-DT-ST DVD-ROM GDR8082N 0L03>   at scbus2 target 0 lun 0 (cd0,pass9)

The only change I made recently is to group two amdump commands in 
/etc/crontab into one script and run this script from crontab, and 
AMANDA actually did a successful backup for both configurations once 
before this email problem occurred. The script only has 3 lines:

#!/usr/bin/bash

/usr/local/sbin/amdump DailyConf1
/usr/local/sbin/amdump DailyConf2

>     C
>
>   
Well hope the info above is slightly more useful than the previous one I 
provided. Thanks again for taking a look.

Bestest Regards

Tong



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.barnet.com.au/pipermail/bugs/attachments/20080114/249aec99/attachment-0001.html 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dmesg.txt
Url: http://mailman.barnet.com.au/pipermail/bugs/attachments/20080114/249aec99/attachment-0001.txt 


More information about the BUGS mailing list