Examination of SSD Life and Health Judgment under CentOS

Time:2019-9-11

It’s unfair for poor people like us to only use Crucial and OCZ users. Is it really impossible for me to see the lifetime of other businesses’SSDs through RAID cards?

Studied, all the commands to view SSD need to use MegaCli and smartCtl to get the usage of SSD disk as long as they are through RAID.

RAID cards are LSI Logic/Symbios Logic MegaRAID SAS 1078 and 2108. Use the usual MegaCli to query:

This is the download address of this:

MegaCli of Centos5

MegaCli of Centos6

The whole process is divided into two steps. The first step is to get the information of the following hard disk through RAID card. Next, smartCtl is used to display the detailed information of the hard disk.

Use MegaCli to get information about the hard disk under the RAID card:

Then use the following command:

/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL

This allows you to find the contents of the RAID card below. It will be shown as follows:

Enclosure Device ID: 252

Slot Number: 7

Device Id: 28

Sequence Number: 2

Media Error Count: 0

Other Error Count: 1

Predictive Failure Count: 0

Last Predictive Failure Event Seq Number: 0

PD Type: SATA

Raw Size: 119.242 GB [0xee7c2b0 Sectors]

Non Coerced Size: 118.742 GB [0xed7c2b0 Sectors]

Coerced Size: 118.277 GB [0xec8e000 Sectors]

Firmware state: Online, Spun Up

SAS Address(0): 0x1e394d57aa996b80

Connected Port Number: 7(path0)

Inquiry Data: 0000000011070303A99EC300-CTFDDAC128MAG                      0007   

FDE Capable: Not Capable

FDE Enable: Disable

Secured: Unsecured

Locked: Unlocked

Needs EKM Attention: No

Foreign State: None

Device Speed: 6.0Gb/s

Link Speed: 1.5Gb/s

Media Type: Solid State Device

Note that in the above several places, only Media Type: Solid State Device. This means SSD. Device Id: 28 needs to be noted. This will be needed later when using smartctl to query. We can see that the model of hard disk is shown above: Inquiry Data: 00000000110303A99EC. Another sign tells you whether the SSD is a normal Firmware state: Online, Spun Up option, so if you do SSD monitoring alarm, direct monitoring of this parameter is basically enough.

Use smartctl to get detailed information about SSD hard drives

Note that different manufacturers have different information about different models of disks. Hard disk information like Intel is not introduced. Here are the commands used to query. Where – A is for displaying all the information. – D is for setting up the hard disk. At this time, it is important to note that different RAID cards may use different interfaces, so there may be small ones. Inequality.

For example, Intel’s hard disk, directly using – D megaraid, 27 will be normal. But when I use the above raid card, I need to specify the SAT parameter, which becomes as follows:

smartctl -a -d sat+megaraid,27 /dev/sdb1 -s on

Sat refers to the device converted from SCSI to ATA, which can add parameters such as SCSI and ata.

At this time, the following information will be displayed:

Model Family:     Crucial/Micron RealSSD C300/C400

Device Model:     C300-CTFDDAC128MAG

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       –       0

5 Reallocated_Sector_Ct   0x0033   100   100   000    Pre-fail  Always       –       0

9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       –       5572

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       –       3

170 Grown_Failing_Block_Ct  0x0033   100   100   000    Pre-fail  Always       –       0

171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       –       0

172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       –       0

173 Wear_Levelling_Count    0x0033   090   090   000    Pre-fail  Always       –       536

174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       –       1

181 Non4k_Aligned_Access    0x0022   100   100   000    Old_age   Always       –       0 0 0

183 SATA_Iface_Downshift    0x0032   100   100   000    Old_age   Always       –       0

184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       –       0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       –       0

188 Command_Timeout         0x0032   100   100   000    Old_age   Always       –       0

189 Factory_Bad_Block_Ct    0x000e   100   100   000    Old_age   Always       –       250

195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       –       0

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       –       0

197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       –       0

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      –       0

199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       –       0

202 Perc_Rated_Life_Used    0x0018   090   090   000    Old_age   Offline      –       10

206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       –       0

If OCZ:

Device Model:     OCZ-AGILITY3

Serial Number:    OCZ-1OX963Q8B5X2V684

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate     0x000f   086   086   050    Pre-fail  Always       –       135388659

5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       –       9

9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       –       265772576277126

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       –       15

171 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       –       9

172 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       –       0

174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      –       13

177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      –       1

181 Program_Fail_Cnt_Total  0x0032   000   000   000    Old_age   Always       –       9

182 Erase_Fail_Count_Total  0x0032   000   000   000    Old_age   Always       –       0

187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       –       0

194 Temperature_Celsius     0x0022   030   030   000    Old_age   Always       –       30 (Lifetime Min/Max 30/30)

195 Hardware_ECC_Recovered  0x001c   120   120   000    Old_age   Offline      –       135388659

196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       –       9

201 Soft_Read_Error_Rate    0x001c   120   120   000    Old_age   Offline      –       135388659

204 Soft_ECC_Correction     0x001c   120   120   000    Old_age   Offline      –       135388659

230 Head_Amplitude          0x0013   100   100   000    Pre-fail  Always       –       100

231 Temperature_Celsius     0x0013   100   100   010    Pre-fail  Always       –       0

233 Media_Wearout_Indicator 0x0000   000   000   000    Old_age   Offline      –       2531

234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       –       3465

241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       –       3465

242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       –       2030

Parametric analysis of whether SSD is healthy:

Note that the service life is not a parameter like Intel ssd’s Media_Wearout_Indicator (OCZ, of course, has become Perc_Rated_Life_Used in Crucial). But in fact, we need to see whether SSD is healthy, mainly through Wear Leveling Count and Grown Failling Block Ct. A parameter.

Note the following two lines:

170 Grown_Failing_Block_Ct  0x0033   100   100   000    Pre-fail  Always       –       0

173 Wear_Levelling_Count    0x0033   090   090   000    Pre-fail  Always       –       536

The above two parameters are the key:

Wear Levelling Count: Let’s start with this parameter. It’s even more important. First, we declare that this hard disk has been used for one year. The figure shows 536 times of total write/erase (P/E) of this 128G hard disk. It shows that there are still 90% lifetime. So the flash memory used by this hard disk is about 90%. The lifetime of particles is 5000 times. 536 is about 10% of 5000, so this value is 90 (CA). Grown Failing Block Count: This represents the number of bad blocks (similar to HDD bad path) that appear when SSD flash particles are in use, where the data is 0, there are no bad blocks, if your life is not good. Buy back SSD in normal use, in a very short period of time this data has changed a lot, that represents the disk may have problems, early find after-sales service bar.

Introduction of MegaCli’s commonly used parameter combination:

MegaCli-cfgdsply-aALL | grep “Error” [Normally all 0]

MegaCli-LDGetProp-Cache-LALL-a0 [Writing Strategy]

MegaCli-cfgdsply-aALL | grep “Memory” [Memory Size]

MegaCli-LDInfo-Lall-aALL [Check RAID Level]

MegaCli-AdpAllInfo-aALL [Check RAID Card Information]

MegaCli-PDList-aALL [View Hard Disk Information] [MegaCli-PDList-aALL] [View Hard Disk Information]

MegaCli-AdpBbuCmd-aAll [View Battery Information]

MegaCli-FwTermLog-Dsply-aALL [View RAID Card Log]

MegaCli-adpCount [Display adapter number]

MegaCli-AdpGetTime-aALL [Display adapter time]

MegaCli-AdpAllInfo-aAll [Display all adapter information]

MegaCli-LDInfo-LALL-aAll [Display all logical disk group information]

MegaCli-PDList-aAll

MegaCli-AdpBbuCmd-GetBbuStatus-aALL | grep “Charger Status” [View Charge Status]

MegaCli-AdpBbuCmd-GetBbuStatus-aALL [Display BBU status information]

MegaCli-AdpBbuCmd-GetBbuCapacityInfo-aALL [Display BBU Capacity Information]

MegaCli-AdpBbuCmd-GetBbu Design Info-aALL [Display BBU Design Parameters]

MegaCli-AdpBbuCmd-GetBbuProperties-aALL

MegaCli-cfgdsply-aALL * Display RAID card model, RAID settings, DISK related information

Changes in tape status, from disc pulling to disc insertion:

Device                    |Normal|Damage|Rebuild|Normal

Virtual Drive       |Optimal|Degraded|Degraded|Optimal

Physical Drive    |Online|Failed –> Unconfigured|Rebuild|Online

Recommended Today

Resolving Time Zone Errors with Docker

1. The correct time zone in China is set to CST, i.e.China Standard TimeIn usedockerstart-upJenkinsWhen mirroring, the settings for mirroring are usuallyCoordinated Universal Time。 So add parameters at startup -v /etc/localtime:/etc/localtimeMount the local time zone to the mirror, so that the mirror gets the correct time zone. Then go to Jenkins and fill in the […]