Reliability of Storage Media
Failure of a storage medium, hard drive, flash drive, or SSD, not to mention mobile portable devices, is a fairly common occurrence associated with the reliability of the device and its operating conditions.
The reliability of the device always depends on the quality of the parts used, on the quality of the assembly of individual modules into a fully functioning device. The entire production process is monitored by the technical control service.
The AFR and MTBF and MTTF metrics are often used to make a relative assessment of device reliability.
AFR and MTBF/MTTF
AFR (Annualized Failure Rate - failure rate per year)
MTTF (Mean Time To Failure)
MTBF (Mean Time Between Failure)
The higher the MTBF, the more reliable the device.
MTTF is specified in hours. The MTTF can be obtained in a variety of ways: using laboratory test data, using actual field failure data, or using part failure rate prediction models.
MTTF = 1/(sum of failure rates of all parts).
Here's what TOSHIBA says about it.
“Instead of MTTF (Mean Time To Failure), the term MTBF (mean time between failures) is sometimes used. Mean time between failures refers to the time from one failure to the next after the first failure is fixed. Because storage components are usually beyond repair, MTBF is irrelevant; So MTTF is the correct term in this case."
White Paper September 2015
Example.
A typical storage system component mean time between failures of 1 million hours means that for a population of 1 million drives running in systems, one device per hour can be expected to fail if it performs according to reliability specifications. Out of 1000 drives, this would mean that a failure could occur every 1000 hours.
1 million hours is equal to 114 years. But this does not mean that one drive will last 114 years, because the MTTF specification is only valid for drives that work within the warranty period.
And, here, Seagate's website states that it no longer uses the industry standard "Mean Time Between Failures" (MTBF) to quantify the average failure rate of hard drives. Seagate is moving to another standard: Annual Failure Rate (AFR).
For drives operating 24 hours a day, 7 days a week, the expected statistical failure rate per year can be calculated from the MTTF using the following formula:
The decrease by the exponential term is due to the fact that already failed disks should be taken into account in the statistics. However, for a small % AFR, this reduction due to already failed drives is negligible, and the formula can be approximated as follows:
Example.
An MTBF of 1 million hours would mean an AFR of 0.876%, or up to 9 drives out of 1000 working drives could fail within a year. Datacenters will have to budget for that many drive repairs or replacements. With a quoted MTTF of 1 million hours, the actual failure rate is approximately 9 disks per 1000 during the year. Let's assume that these drives were within the warranty period and used in accordance with the operating conditions and environmental restrictions.
A higher failure rate will mean that the manufacturer does not meet the required reliability requirements.
Technical documentation for some hard drives
Consider the documentation for WD Gold drives.
WD101KRYZ - 10TB, WD8002FRYZ - 8TB, WD6002FRYZ - 6TB, WD4002FYYZ - 4TB, WD2005FBYZ - 2TB, WD1005FBYZ - 1TB
Here we see that drives from 1TB to 6TB have the same MTFB = 2000000 hours, and, accordingly, the same AFR = 0.44%. This means that within 1 year out of 2,000,000 disks, 8800 disks can fail. However, in the table above these data there are small numbers for clarification, numbers 5 and 6. And it can be seen that for the 4TB and 6TB models, MTFB and AFR are calculated under different conditions, with a lower annual data transfer load. Hence, they are less reliable than 1TB and 2TB models. Which is to be expected.
Further more interesting. Based on the specifications in this table, WD101KRYZ and WD8002FRYZ hard drives have higher reliability (AFR= 0.35; 7 platters and 14 heads) than WD6002FRYZ (AFR= 0.35; 4 platters 8 heads). For what reasons this happens is not clear, the documentation does not indicate.
A little about annual loads.
In the documentation under clarification 5 for 1TB and 2TB disks:
“The product MTBF and AFR specifications are based on a base operating temperature of 40°C and a typical system workload of 219 TB/yr. Workload is defined as the amount of user data transferred to or from the hard drive. The product is designed for workloads up to 550 TB per year.”
Now, let's see what this means in practice. We calculate the estimated load for a disk in one day: 360 days / 550TB = 664GB / day. This is the total amount of data written and read by the hard disk in one day.
With a read or write speed of 100MB/s, you can read or write 360GB per hour, 8.64TB per 24 hours.
Therefore, with this intensity of use, the reliability of the hard drive will decrease, and it may be necessary to replace it before the warranty period.
Distribution of only torrents at a speed of 5MB / sec - 432GB / day - 157.680TB / year, if 10MB / sec - 315.360TB / year. For WD Gold 1TB WD1005FBYZ and 2TB WD2005FBYZ drives, this load is within the estimated 550TB per year. But for disks from 4TB to 10TB, it no longer fits, since for them all calculations are given with a typical load of 219TB per year:
“The product MTBF and AFR specifications are based on a base operating temperature of 40°C and a typical system workload of 219 TB/yr. Workload is defined as the amount of user data being transferred to or from a hard drive."
Therefore, the reliability of the drive within the performance of the warranty period is expected to be reduced.
When working in systems with large loads, the amount of data transferred can be many times larger, which means a sharp exponential decrease in reliability.
And here is the WD Red Plus:
WD140EFGX, WD140EFFX - 14TB; WD120EFBX, WD120EFAX - 12TB; WD101EFBX, WD10EFAX - 10TB; WD80EFBX, WD80EFAX, WD80EFZX, WD80EFZZ - 8TB; WD60EFZX - 6TB; WD40EFZX - 4TB; WD30EFZX - 3TB; WD20EFZX - 2TB; WD10EFRX, WD10JFCX - 1TB
For some reason, the WD Red Plus has lost the AFR characteristic from the specification. But, it is very easy to calculate. AFR = 0.87%. And this means that the reliability is several times lower than that of WD Gold, since footnote 9 says for these drives:
“The MTBF specifications are based on a sample population and are estimated by statistical measurements and acceleration algorithms under typical operating conditions of 90 TB/year and a disk temperature of 40°C. If these parameters are exceeded, and the drive temperature is up to 65°C, the MTBF will degrade. MTBF does not determine the reliability of an individual drive and is not a guarantee."
Therefore, distribution of only torrents at speeds of 5MB/s - 432GB/day - 157,680TB/year will already lead to a strong decrease in the calculated reliability within the warranty period for drives of this type.
For WD SSD drives designed to work in NAS systems, exact AFR and MTBF/MTTF numbers are not available:
It is written that MTTF estimates are based on internal testing by the Telcordia stress test. The MTTF column says "up to 2M".
WD Blue Hard Drives:
WD80EAZZ - 8TB; WD60EZAZ - 6TB; WD40EZAZ - 4TB; WD30EZAZ - 3TB; WD20EZBX, WD20EZAZ - 2TB; WD10EZRZ, WD10EZEX - 1TB; WD5000AZRZ, WD5000AZLX - 500GB.
There is no information about AFR and MTBF/MTTF in this specification.
WD Blue Mobile Hard Drives:
WD20SPZX - 2TB; WD10SPZX -1TB; WD5000LPZX - 500GB; WD320LPCX, WD320LPVX - 320GB.
There is no data on AFR and MTBF/MTTF.
AFR and MTBF/MTTF calculations for SanDisk microSD memory cards are not included in the technical documentation:
Now, let's look at the manufacturer Seagate.
Popular 2.5" discs for mobile devices:
ST2000LM010, ST2000LM007 (2 plates 4 heads) - 2TB;
ST1500LM012 (2 plates 4 heads) - 1.5TB;
ST1000LM038, ST1000LM035 (1 plate 2 heads) - 2TB;
ST500LM033, ST500LM030 (1 plate 2 heads) - 2TB;
As you can see, the models are different, and structurally differ in the number of working surfaces, but for some reason the reliability characteristics are the same.
Unfortunately, AFR and MTBF/MTTF reliability calculations are not provided for products of this level. But it states that “Average annual workload estimate: <55 TB/year. The product specifications assume that the I/O workload does not exceed the average annual workload limit of 55 TB/year. Workloads exceeding the annual rate may degrade and affect the reliability of the device. The average annual workload limit is given in units of TB per calendar year.”
55TB / year - is it a lot or a little? Uniform load 55TB/year = 150.7GB/day.
Let's look at Seagate SAS models:
ST4000NM0023, ST4000NM0043, ST4000NM0063 - 4TB
ST3000NM0023, ST3000NM0043, ST3000NM0063 - 4TB
ST2000NM0023, ST2000NM0043, ST2000NM0063 - 4TB
ST1000NM0023, ST1000NM0043, ST1000NM0063 - 4TB
AFR = 0.63%, MTBF=1400000 hours. This means that after a year of operation under the conditions specified in the technical documentation, out of 1000 drives, 0.63%, that is, 6 - 7 drives, may fail. It is confusing that the reliability is the same for all hard drives of this family.
Some Seagate SSD Models:
AFR = 0.58%, but subject to "Total Terabytes Written (TBW) Over Warranty Period".
Let's take the ST480FP0021 as an example. Warranty Period = 5 Years, Total Terabytes Written (TBW) during Warranty Period = 350TB, i.e. 0.58% Reliability at Average Daily Load = (350TB/5yrs = 70TB/year; 70TB/365days = 192GB/day) 192GB in a day.