[Novalug] Monitoring SMART drive attributes via SNMP?

Jeremy Arthur jarthur77@hotmail.com
Fri Jun 5 21:15:45 EDT 2015


I may have discovered the specific smartd attribute that might be my smoking gun

High Load_Cycle_Count (193)

The drives Advanced Power Management features are causing the drives heads to
park after a short period of inavtivity (5-10 seconds). This can severly limit
the lifetime of the drive.  Use the hdparm -B option to adjust this timeout
behavior.

If you have used desktop/laptop drives in a RAID/NAS setup like I have, because
a) You are a cheapskate
b) You didn't know any better at the time of purchase
c) All of the above

You may want to take note
======================================================
# Does your drive support Advanced Power Management (APM)?
hdparm -i /dev/sdb
/dev/sdb:
...
 AdvancedPM=yes WriteCache=enabled
..

# Get the current setting
hdparm -B /dev/sdb

/dev/sdb:
 APM_level    = 128

*128 seems to have been the default in my case

According to man hdparm
...
       -B     Get/set Advanced Power Management feature, if the drive supports
              it.  A  low  value  means aggressive power management and a high
              value means better performance.  Possible  settings  range  from
              values  1  through  127 (which permit spin-down), and values 128
              through 254 (which do not permit spin-down).  The highest degree
              of  power  management  is  attained with a setting of 1, and the
              highest I/O performance with a setting of 254.  A value  of  255
              tells  hdparm to disable Advanced Power Management altogether on
              the drive (not all drives support disabling it, but most do).
...
======================================================
According to
http://www.freeminded.org/index.php/2012/04/hdparm-power-down-drives-when-not-in-use/ 

1-127      permit spindown
128-254  do not permit spindown
255         disables APM

1-240      5 sec intervals
1             5 sec
127         635 sec
240         1200 sec
241-254  30 minute intervals
241         30 min
254         420 min
255         disables APM *seems to be a popular option amongst NAS owners 

hdparm attributes set via cmdline, do not persist across reboots
unless you set them in /etc/hdparm.conf (specific to debian systems)

In my experience on CentOS 6.x, these hdparm settings do persist across reboots

I have since set my default to be 241 = 30 minutes
======================================================
You will want to set your smartd configuration so as not to spinup the drives if they are in standby/idle
in /etc/smartd.conf add '-n standby' or '-n idle' option

======================================================
One last thing, If you do use the cacti files from the https://www.pitt-pladdy.com link I referenced below
This Load_Cycle_Count (193)  seems to be the only attribute that he doesn't account for by default

To add it
1) Add the extend line in snmpd.conf
...
extend smart193 /etc/snmp/smart-generic 193
...

to test
snmpwalk -v2c -cpublic 127.0.0.1 .1.3.6.1.4.1.8072.1.3.2   

2)cacti/resource/snmp_queries/disk_smart.xml
<smart193>
      <name>193 Load Cycle Count</name>
      <method>walk</method>
      <source>value</source>
      <direction>output</direction>
      <oid>NET-SNMP-EXTEND-MIB::nsExtendOutLine."smart193"</oid>
</smart193>

3) Add the line/gprint entries into the Cacti graph template

4) Create the new graphs

======================================================
TL;DR

You shouldn't use desktop/laptop grade drives in a NAS/RAID setup, but if you must (or have inherited such a setup), look at the various hdparm, smartctl/smartd settings to help mitigate your failures

http://www.extremetech.com/computing/203478-backblaze-pulls-3tb-seagate-ssds-from-service-details-post-mortem-failure-rates

https://www.backblaze.com/blog/3tb-hard-drive-failure/

~J


> To: novalug@firemountain.net
> Date: Wed, 4 Mar 2015 18:41:56 -0500
> Subject: Re: [Novalug] Monitoring SMART drive attributes via SNMP?
> From: novalug@firemountain.net
> 
> Thanks Peter
> 
> I do have email alerts setup but in my limited experience, by the time smartd has sent the email, the drive has already failed.  The idea behind the graphing, is to possibly see a trend in one of the smart attributes and predict the failure and replace the drive before any data is potentially lost. More proactive than reactive
> 
> I could setup smartd/logwatch to watch certain attributes, but I think I would get desensitized to the email logs rather quickly
> 
> So you would recommend configuring smartd to so do periodic self-tests?  Any recommendations on frequency for short/long? Weekly short/monthly long?
> 
> v/r
> -Jeremy
> 
> > Date: Wed, 4 Mar 2015 16:59:07 -0500
> > To: novalug@firemountain.net
> > Subject: Re: [Novalug] Monitoring SMART drive attributes via SNMP?
> > From: novalug@firemountain.net
> > 
> > Not sure what you're trying to do. If you just want to get notifications
> > of a potential HDD issue, what's wrong with a simple alert email from
> > smartd ?? That's the default, you simply need to add your own email
> > address instead of root, if you don't monitor the root inbox.
> > 
> > Take a look at /etc/smartmontools/* and /etc/sysconfig/smartmontools to
> > configure the startup options.
> > For severe errors you should add features to rsyslog to send you severe
> > messages either by email or to a location you look all the time. In the
> > olden days I had those messages go to my mobile providers "email to SMS"
> > feature, so my phone would go off with the failure message from the
> > infrastructure I was managing.
> > 
> > I don't see much point in putting up a web-site with pretty graphs if
> > you're not looking at them every day to get information. And if you're
> > just looking to have it alert you by emails, that seems to be going the
> > long way around?
> > 
> > Btw. you should run regular smart tests on your drives. The parameters
> > to smartd can be set to manage a schedule for you too.
> > While smartd isn't perfect - I cannot think of anything that is? Unless
> > you buy A LOT of hard drives and can compare production batches, the
> > next best thing for failure predictions is the SMART firmware. And yes
> > it can report false positives and false negatives. It also can report
> > failures long before it happens.  So far, I've never had a problem
> > returning a drive when I enclosed SMART stats showing it was failing.
> > 
> > -- 
> > Regards
> >   Peter Larsen
> > 
> > 
> > 
> > 
> > On 03/04/2015 03:27 PM, Jeremy Arthur via Novalug wrote:
> > > Thoughts anyone?
> > >
> > >
> > >> To: novalug@firemountain.net
> > >> Date: Fri, 27 Feb 2015 20:34:47 -0500
> > >> Subject: [Novalug] Monitoring SMART drive attributes via SNMP?
> > >> From: novalug@firemountain.net
> > >>
> > >> All,
> > >>
> > >> I've lost my fair share of hard drives over the years.  Sometimes they make clicky sounds, sometimes they fill /var/log/messages with obscure hex error codes, sometimes you come home to an unresponsive box, or sometimes, just sometimes, they are nice enough cough up a SMART error 2 days after the warranty expires. FML FWP
> > >>
> > >> This has caused me to become ever more vigilant in my quest to protect my data.  I've tried various backup methodologies, raid configurations and email alerts.  I am currently using ZFS in concert with smartd email alerts.  Over the holidays, I lost yet another raid drive.  smartd sent me the email alert like it was supposed to, and the raid went into a degraded state, but the drive was already dead at this point.
> > >>
> > >> What good is an alert if the drive is already dead?  Once I got the alert email, it seemed like a case of too little too late.  I would like a little advanced notice so I can maybe replace the drive or move the data to another drive/server. So I currently monitor SMART attributes via a Cacti setup as per this link
> > >>
> > >> https://www.pitt-pladdy.com/blog/_20091031-144604_0000_SMART_stats_on_Cacti_via_SNMP_/
> > >>
> > >> In the hopes I can see a trend in one of the attributes and replace a drive before it fails.  To summarize the link
> > >>
> > >> 1. Create cron job to dump smartctl data to text file
> > >> 2. Use the custom perl script to parse the smartctl data
> > >> 3. Add an extend rule to /etc/snmp/snmpd.conf for each attribute (~60 rules)
> > >> 4. Import the Cacti templates and graph yourself silly
> > >>
> > >> TL;DR
> > >>
> > >> Here's what I would like to know from you guys:
> > >> 1) What do you use to monitor hard drives for impending failure? Nagois perhaps?
> > >> 2) Most Cacti graphs use default SNMP data (don't need steps 1-3).  Is there a SMART SNMP MIB somewhere?
> > >> 3) Any other advice?
> > >>
> > >> -J
> > >>
> > >>
> > >>  		 	   		  
> > >> **********************************************************************
> > >> The Novalug mailing list is hosted by firemountain.net.
> > >>
> > >> To unsubscribe or change delivery options:
> > >> http://www.firemountain.net/mailman/listinfo/novalug
> > >  		 	   		  
> > > **********************************************************************
> > > The Novalug mailing list is hosted by firemountain.net.
> > >
> > > To unsubscribe or change delivery options:
> > > http://www.firemountain.net/mailman/listinfo/novalug
> > 
> > 
> > 
> > 
> > **********************************************************************
> > The Novalug mailing list is hosted by firemountain.net.
> > 
> > To unsubscribe or change delivery options:
> > http://www.firemountain.net/mailman/listinfo/novalug
>  		 	   		  
> **********************************************************************
> The Novalug mailing list is hosted by firemountain.net.
> 
> To unsubscribe or change delivery options:
> http://www.firemountain.net/mailman/listinfo/novalugTuning

 		 	   		  


More information about the Novalug mailing list