Re: [Hampshire] [OT] MTBF

Author: Stephen Rowles
Date:
To: Hampshire LUG Discussion List
Subject: Re: [Hampshire] [OT] MTBF

James Courtier-Dutton wrote:
>
> I think people don't seem to realize that HDs have very low resistance
> to shock while switched on, and this is the main cause of HD failures.
> On the HDs that failed on you, were you able to determine the max G
> force that it received during its powered on life? I don't know how to
> get that information, but I am sure it would provide for some
> interesting stats. I think it would also be useful to have some study
> that would try and look into why HDs fail early in life, and thus try
> and recognize which ones are more likely to fail early. We might
> eventually get to the HAL 9000 prediction that a device is certain to
> fail in 72 hours, but until that time it will function normally.
>
>
Most (all?) modern HDDs have a whole raft of sensors and store life time
information about read errors, temperature range etc. etc. this is SMART
(you might see on the bios screen). In Linux you can query this using
smartctl:

~]# smartctl --all /dev/sda

For example the stats from my current drive here at work:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   093   006    Pre-fail  
Always       -       16203744
  3 Spin_Up_Time            0x0003   098   095   070    Pre-fail  
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   
Always       -       69
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  
Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  
Always       -       139114079
  9 Power_On_Hours          0x0032   088   088   000    Old_age   
Always       -       11202
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   
Always       -       99
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   
Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   
Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   057   045    Old_age   
Always       -       39 (Lifetime Min/Max 21/43)
194 Temperature_Celsius     0x0022   039   043   000    Old_age   
Always       -       39 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   064   060   000    Old_age   
Always       -       164354431
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   
Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   
Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   
Always       -       0

You can see all sorts of interesting things here which can easily be
used to warn on pending failure of a drive.

Also most laptop drives now have accelerometers which will detect any
dangerous shock conditions and park the drive heads to prevent further
damage to the drive. I cannot find it now but I watched a video on the
web showing SSD vs HDD in a vibration test, the laptop was playing a
video and being vibrated... the SSD obviously didn't miss a beat, while
the HDD paused and stuttered but certainly didn't die, once the
vibration stopped everything returned to normal.

This message is part of the following thread:
	the complete thread tree sorted by date
	James Courtier-Dutton at
	Hugo Mills at