Uncategorized

Monitoring S.M.A.R.T devices

During my server disaster earlier mentioned, I decided to monitor the drives a bit more closely. The drives on the system were SMART capable, so I emerge’d sys-apps/smartmontools. Here is the description of that package:


sys-apps/smartmontools
Available versions: 5.36-r1 5.37 ~5.37-r1 5.38 {minimal static}
Installed versions: 5.38(10:56:28 10/09/08)(-minimal -static)
Homepage: http://smartmontools.sourceforge.net/
Description: control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (S.M.A.R.T.)

Using smartctl you can quickly test and analyze the health of your drives. Also you can enable the daemon to monitor the drives and to alert you of any changes or issues. Simply add the daemon to rc.


rc-update add smartd default

I started seeing some interesting bits in the logs. Here is a sample:

During my server disaster earlier mentioned, I decided to monitor the drives a bit more closely. The drives on the system were SMART capable, so I emerge’d sys-apps/smartmontools. Here is the description of that package:


sys-apps/smartmontools
Available versions: 5.36-r1 5.37 ~5.37-r1 5.38 {minimal static}
Installed versions: 5.38(10:56:28 10/09/08)(-minimal -static)
Homepage: http://smartmontools.sourceforge.net/
Description: control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (S.M.A.R.T.)

Using smartctl you can quickly test and analyze the health of your drives. Also you can enable the daemon to monitor the drives and to alert you of any changes or issues. Simply add the daemon to rc.


rc-update add smartd default

I started seeing some interesting bits in the logs. Here is a sample:


Nov 25 02:17:49 comp smartd[5381]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 96 to 97
Nov 25 02:17:49 comp smartd[5381]: Device: /dev/hdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 62 to 60
Nov 25 02:47:50 comp smartd[5381]: Device: /dev/hdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 60 to 62
Nov 26 16:47:50 comp smartd[5381]: Device: /dev/hdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 62 to 63
Nov 26 18:17:49 comp smartd[5381]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 94 to 95
Nov 26 18:17:49 comp smartd[5381]: Device: /dev/hdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 63 to 62
Nov 26 18:47:50 comp smartd[5381]: Device: /dev/hdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 62 to 61
Nov 26 19:17:49 comp smartd[5381]: Device: /dev/hdd, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 61 to 62

Googling around and talking to the server farm support, these are “normal” drive chatter. I found a post mentioning this and suggested to add a -l flag to /etc/smartd.conf. I added the following to my DEVICESCAN, and restarted the service.


DEVICESCAN -l error

I’ll be looking into monitoring my smart devices with cacti next to establish history and pattern. I’ll be sure to document that process when I get it done.