[North Asia data recovery] data recovery of IBM DS series storage server hard disk failure and mapping error

Time:2022-5-28

Server data recovery environment:

IBM DS series storage server;
16 FC hard disks with a capacity of 600g.

Fault:

On the front panel of the storage server, the No. 10 and No. 13 hard disk fault lights are on, the volumes mapped to the RedHat cannot be mounted, and the service is offline. The server administrator shall contact the North Asia Data Recovery Center for server data recovery.

Server data backup and test process:

1. After arriving at the site, the North Asia data recovery engineer connects to the storage through the storage manager to view the current storage status, and the storage reports that the logical volume status fails. Physical disk status: disk 6 reports “warning” and disks 10 and 13 report “failure”.

2. The current stored complete logs are backed up through the storage manager. By parsing the backed up storage logs, the North Asia data recovery engineer obtains some information about the logical volume structure.

3. Label the 16 FC disks and remove them from the storage after registration according to the original slot number. Perform a rough test on the 16 FC disks using the FC disk image device “r510+sun3510”. All the 16 disks can be identified normally. The smart status of 16 disks was detected respectively. It was found that the smart status of disk 6 was “warning”, which was consistent with the report in IBM storage manager.

4. In the windows environment, the FC disk identified by the device is marked as offline in the disk manager, providing a write protection function for the original disk. The North Asia data recovery engineer then uses the software to perform sector level mirroring on the original disk, mirroring all physical sectors in the original disk to the logical disk and saving them in the form of files. During the mirroring process, it is found that the mirroring speed of disk 6 is very slow. Based on the comprehensive judgment of the problems found during the previous smart state detection of the hard disk, disk 6 should have a large number of damaged and unstable sectors, which makes the general application software unable to operate it.

5. Use the bad channel hard disk mirroring device to mirror the bad channel of hard disk 6. During the mirroring process, observe the speed and stability of the image. It is found that there are not many bad channels in disk 6, but there are a large number of unstable sectors with long read response time. Therefore, the data recovery engineer of North Asia adjusted the copy strategy of disk 6, and continued to mirror disk 6 after modifying the parameters such as “number of sectors skipped in case of bad channel” and “response waiting time”. At the same time, observe the mirror image of the remaining disks.

6. After all the images are completed, the log is checked. It is found that disk 1, which does not report an error in the storage manager and the smart state of the hard disk, also has bad tracks, and disk 10 and disk 13 have a large number of irregular bad tracks. By using the software to locate the target image file and analyzing the bad track list, it is found that some of the key source data information of the ext3 file system has been damaged by bad tracks, so we can only wait until disk 6 is mirrored, The damaged file system is repaired manually by XOR with the same stripe and according to the context of the file system.

7. The image of disk 6 is completed, but the previous copy policy set to maximize the backup of effective sectors and to protect the head will automatically skip some unstable sectors, so the current image is incomplete. The North Asia data recovery engineer adjusts the copy strategy, continues to mirror the skipped sectors, and completes all the mirroring of all sectors of disk 6.

8. The physical sector images of all the hard disks were obtained. All the mirrored files were expanded by software on the platform. According to the reverse of ext3 file system and the analysis of log files, the data recovery Engineer in North Asia obtained the disk sequence of 16 FC disks in storage, the block size of raid, and the verification direction and method of raid. The data recovery engineer of North Asia tried to virtually reorganize raid through software. After the raid was built, he further analyzed the ext3 file system, extracted some Oracle DMP files through communication with the server administrator, and the server administrator tried to recover.

9. In the process of DMP recovery, the database reported an imp-0008 error. Through careful analysis of the log file of the imported DMP file, the data recovery engineer of North Asia found that there were problems in the recovered DMP file, which led to the failure of DMP data import. North Asia data recovery engineers immediately re analyzed the raid structure to further determine the extent of damage to the ext3 file system. After several hours’ efforts, the DMP file and the original DBF library file were restored, and the recovered DMP file was handed over to the server administrator for data import test. No problems were found in the test. Then verify and detect the recovered DBF original library files, and all files can pass the test.

Server database recovery process:

1. Copy the database file to the original database server. The path is /home/oracle/tmp/syntong as a backup. Create an oradata folder under the root directory, copy the entire syntong folder backed up to the oradata directory, and then change the group and permissions of the oradata folder and all its files.

2. Backup the original database environment, including Oracle_ Related files in the product folder under home. Configure listening and use splplus in the original machine to connect to the database. Attempt to start the database to the unmount state. After querying the basic status, confirm that there is no problem with the environment and parameter files. Try to start the database to mount status, and there is no problem querying the status. Start the database to the open state. An error is reported:
[North Asia data recovery] data recovery of IBM DS series storage server hard disk failure and mapping error

3. After further detection and analysis, the data recovery engineer of North Asia determined that the fault was caused by inconsistent information between the control file and the data file, which was a common fault caused by power failure or sudden shutdown.

4. Check the database files one by one. It is found that all database files are not physically damaged.

5. In the mount state, the North Asia data recovery engineer backs up the control file, alter database backup controlfile to trace as’ /backup/controlfile’; View and modify the backed up control file to obtain the command to rebuild the control file. Copy these commands to a new script file, controlfile SQL.

6. Close the database and delete the three control files under /oradata/syntong/. Start the database to the unmount state and execute controlfile SQL script.
[North Asia data recovery] data recovery of IBM DS series storage server hard disk failure and mapping error

7. After reconstructing the control file, start the database directly and report an error, which needs further processing.
[North Asia data recovery] data recovery of IBM DS series storage server hard disk failure and mapping error

Then execute the restore command:
[North Asia data recovery] data recovery of IBM DS series storage server hard disk failure and mapping error

Perform media recovery until the report is returned and the recovery is complete.

8. Try to open the database.
SQL> alter database open resetlogs;

9. The database was started successfully. Add the data file of the original temp table space to the corresponding temp table space.

10. Perform various routine checks on the database without any errors.

11. Perform an EMP backup. The full database backup is completed without error. Connect the application to the database for application level data validation.

12. The data verification is completed, the database repair is completed, and the data recovery is successful.