Research on the scheme of automatically cleaning a large number of files under Linux

Time:2020-10-31

It is the daily work of a system administrator to clean up the expired and garbage files regularly and maintain the reasonable space utilization rate of the file system. For small and medium-sized file systems, simple system commands or scripts can be implemented; but for large and super large file systems with hundreds of millions or even billions of files, file cleaning becomes a difficult task. If we determine which files need to be cleaned up, how to clean up a large number of files, and how to ensure the cleaning performance are all the problems that the system administrator needs to solve. This paper discusses the related commands and methods of automatic cleaning large quantities of files under Linux, as well as the best practice in actual operation.

Requirements for automatic file cleaning
In the hands of system administrators, they manage the most valuable asset data of the enterprise; and Linux, which occupies half of the enterprise server operating system market, makes the Linux system administrator become the most important asset manager. The most important thing for it is to make data storage valuable. In 1991, when IBM launched a 3.5-inch 1GB hard disk, the administrator knew every file on the hard disk and could manage the file manually. Today, the Pb level storage device has brought unprecedented challenges to file management.
Anyone who has used linux should be able to delete files. What can you do to delete the following files?
Delete files that end with a specific suffix in the entire file system
Delete a specified file in a file system with 1 million
From a ten million level file system, delete 100000 files created on the specified date
In the 100 million level file system, the file system cleaning is performed every day, and millions of files produced a year ago are deleted
The following is to discuss how to achieve the above file deletion strategy and method, if the above operation is easy for you, you can ignore this article.
For cleaning the file system, we can simply divide the cleaning task into two categories: cleaning up expired files and cleaning garbage files.

Expired documents
Any data has its own life cycle. The life cycle curve of data tells us that the value of data is the greatest in the period of time after it is generated, and then the value of data decays with time. At the end of the data life cycle, these expired files should be deleted to free up storage space for valuable data.
Junk files
During the operation of the system, a variety of temporary files will be generated, such as temporary files during application running, trace files generated by system errors, core dump, etc. after these files are processed, they will lose their retention value. These files can be collectively referred to as garbage files. Cleaning the garbage files in time is helpful to the maintenance and management of the system and to ensure the stable and effective operation of the system.

Overview of automatic file cleaning
Features and methods of automatic file cleaning
RM can delete a file under the specified absolute path. If you only know the file name but not the path, we can find it through ‘find’ and delete it. By extension, if we can find the specified file according to the preset conditions, we can delete it. This is the basic idea of automatic file cleaning. According to the preset conditions, the list of files to be deleted is generated, and then the regular cleaning task is executed to delete.
For expired files, their common mark is time stamp. According to different file systems, it may be different time attributes such as file creation time, access time, expiration time, etc. Due to the fact that most of the expired documents exist in the filing system, the characteristics of this kind of files are huge. For large-scale systems, the number of expired documents may reach hundreds of thousands or even millions of orders of magnitude every day. For such a large number of files, it takes a lot of time to scan the file system and generate the file list, so file cleaning performance is a problem that such people have to consider.
For garbage files, it may be a file stored in a specific directory, a file ending with a special suffix, or a 0 generated by a system error For these files, the number of files is generally small, but there are many kinds and the situation is more complex. According to the experience of the system administrator, it is necessary to formulate more detailed file query conditions, scan them regularly, generate a file list, and then carry out further processing.

Introduction to linux commands
Common file system management commands include ‘ls’,’rm’,’find ‘, etc. Since these commands are common system management commands, we will not repeat them here. For detailed usage, please refer to the command help or Linux user manual. As large-scale file systems are generally stored on dedicated file systems, these file systems provide unique commands for file system management. In the practice section of this paper, the GPFS file system of IBM is taken as an example, and some file system management commands of GPFS are briefly introduced.
mmlsattr
This command is mainly used to view the extended attributes of files in the GPFS file system, such as storage pool information, expiration time, etc.
mmapplypolicy
GPFS uses policies to manage files. This command can perform various operations on the GPFS file system according to the user-defined policy files, which has a very high efficiency.

Difficulties in automatic cleaning of large quantities of files
Linux file deletion mechanism
Linux controls file deletion by the number of links. Only when a file does not have any links, the file will be deleted. Each file has two link counters — I_ Count and I_ nlink。 I_ Count means the number of current users_ Nlink means the number of media connections; or I_ Count is the memory reference counter, I_ Nlink is a hard disk reference counter. In other words, when a file is referenced by a process, I_ Count will increase; when creating a hard link to a file, I_ Nlink will increase.
In terms of RM, it means reducing I_ nlink。 There is a problem here. If a file is being called by a process, but the user executes RM operation to delete the file, what will happen? When the user performs RM operation, LS or other file management commands can no longer find the file, but the process still continues to execute normally, and can still read the contents correctly from the file. This is because the ‘RM’ operation only changes I_ Nlink is set to 0; because the file is consumed by the process, I_ The count is not 0, so the system did not actually delete the file. I_ Nlink is a sufficient condition for deleting a file_ Count is the necessary condition for file deletion.
For single file deletion, we may not care about this mechanism at all, but for large-scale file deletion, it is a very important factor. Please allow me to elaborate in the following chapters. Here, please note down the file deletion mechanism of Linux.

Generate list to be deleted
When there are 10 files under a folder, ` ls’ can be seen at a glance, or even ‘LS – ALT’ can be used to view the detailed properties of all the files; when the number of files becomes 100, ` ls’ may only look at the file name; if the number of files increases to 1000, it may be acceptable to turn over several pages; if it is 10000? `Ls’ may take half a day to get results; when it is expanded to 100000, most systems may not respond, or “argument list too long”. Not only ‘ls’ will encounter this problem, but other commonly used linux system management commands will encounter similar problems. Shell has parameters to limit the length of commands. Even if we can extend the command length by modifying shell parameters, it does not improve the efficiency of command execution. For a very large file system, it is not acceptable to wait for the return of common file management commands such as’ ls’ and ‘find’
So how can we generate a list of deleted files on a larger file system? A high-performance file system index is a good way, and a file index with low performance is the patent of a few people (which also explains why Google and Baidu can make so much money). Fortunately, file systems of this size generally only exist in high-performance file systems, which provide very powerful file management functions. For example, the mmapplypolicy of IBM general parallel file system (GPFS) can scan the entire file system quickly by directly scanning inode, and can return the file list according to the specified conditions. The following shows how to get a list of files based on the timestamp and file type.

Effect of deadlock on file deletion performance
For a daily scheduled file deletion task system, the first person forms the files to be deleted, and then uses the list as input to perform the deletion operation. If the list to be deleted is too large on one day, the deletion task of the first day has not been completed, and the deletion task of the next day is started. What will be the result?
Files that have not been deleted on the first day will appear in the list of deleted files the next day, and then the file deletion process on the next day will take it as output to perform the deletion operation. At this time, the deletion process on the first day and the deletion on the second day will try to delete the same file. The system will throw a large number of unlink failures, which will greatly affect the performance of the deletion. The degradation of deletion performance will result in the files still not being deleted the next day, and the deletion process on the third day will aggravate the deadlock of deleted files and enter a vicious circle of performance degradation.
If you simply delete the list to be deleted on the first day, can you solve the above problem? No. As mentioned above in the Linux file deletion mechanism, deleting the first day’s file list file can only change the_ Nlink is cleared. When the file deletion process of the first day is not finished, the file’s I_ Count is not zero, so the file is not deleted. Until the process finishes processing all the files in the list and the process exits, the list files to be deleted on the first day are actually deleted.
We need to terminate other file deletion processes in the system at least before the new file deletion process starts to ensure that the deletion deadlock will not occur. However, there are still some disadvantages. Considering the extreme situation, if the deletion process can not complete the deletion task in one cycle for a continuous period of time, the list to be deleted will continue to grow, and the file scanning time will be prolonged, which will squeeze the working time of the file deletion process and fall into another vicious cycle.
Moreover, the actual combat experience tells us that when the deletion list is particularly large, the performance of the deletion process also decreases. A parameter input file of appropriate size can ensure the effective execution of the process. Therefore, according to the fixed size of the list file to be deleted is divided into a series of files, which can make the deletion operation stable and efficient. Moreover, if the storage and host performance allow, splitting into multiple files also allows us to execute multiple deletion processes simultaneously.

Best practices for automatic cleaning of large quantities of files
Best practices of automatic cleanup for large scale extra years under GPFS file system
The following is the practice of automatic file cleaning on a 10 million level GPFS file system: the hardware environment is two IBM x3650 servers and a DS4200 disk array with a storage capacity of 50tb. Linux operating system and GPFS v3.2 are installed. The goal is to perform a file cleaning operation at 2:00 am every day to delete files 30 days ago and all files ending with TMP.
Mmapplypolicy scan results show that there are 323784950 files and 158696 folders on the system.

Copy code

The code is as follows:

………….
[I] Directories scan: 323784950 files, 158696 directories,
0 other objects, 0 ‘skipped’ files and/or errors.
………….

Define the search rules as follows and save them as trash_ rule.txt

Copy code

The code is as follows:

RULE EXTERNAL LIST ‘trash_list’ EXEC ”
RULE ‘exp_scan_rule’ LIST ‘trash_list’ FOR FILESET(‘data’)
WHERE DAYS(CURRENT_TIMESTAMP) – DAYS(ACCESS_TIME) > 30
RULE ‘tmp_scan_rule’ LIST ‘trash_list’ FOR FILESET(‘data’) WHERE NAME LIKE ‘%.tmp’

Execute mmapplypolicy and cooperate with grep and awk commands to generate a complete list of files to be deleted, and then use the split command to divide the complete list into sub lists with 10000 files in each list

Copy code

The code is as follows:

mmapplypolicy /data – P trash_rule.txt – L 3 | grep
“/data” |awk ‘ {pint $1} ’ > trash.lst
split – a 4 – C 10000 – d trash.lst trash_split_

Execute the following command to delete:

Copy code

The code is as follows:

for a in `ls trash_splict_*`
do
rm `cat $a`
done

Save the above operations as trash_ clear.sh And then define the crontab task as follows:

Copy code

The code is as follows:

0 2 * * *   /path/trash_clear.sh

Manually execute the delete task, and the scanning results of the files to be deleted are as follows:

Copy code

The code is as follows:

[I] GPFS Policy Decisions and File Choice Totals:
Chose to migrate 0KB: 0 of 0 candidates;
Chose to premigrate 0KB: 0 candidates;
Already co-managed 0KB: 0 candidates;
Chose to delete 0KB: 0 of 0 candidates;
Chose to list 1543192KB: 1752274 of 1752274 candidates;
0KB of chosen data is illplaced or illreplicated;

During the file deletion process, we can use the following command to calculate the number of file deletions per minute. From the following output, the file deletion rate is 1546 files per minute:

Copy code

The code is as follows:

df – i /data;sleep 60;df – i   /data
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/data 2147483584 322465937 1825017647 16% /data
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/data 2147483584 322467483 1825016101 16% /data

Use the ‘time’ command to time the file deletion operation. From the output results, the file deletion operation took a total of 1168 minutes (19.5 hours)

Copy code

The code is as follows:

time trash_clear.sh </p>
<p> real 1168m0.158s
user 57m0.168s
sys 2m0.056s

Of course, for the GPFS file system, the file system itself also provides other file cleaning methods. For example, the mmapplypolicy can be used to perform the file deletion operation. Through this method, it is possible to achieve more efficient file cleaning tasks. The purpose of this paper is to discuss a general method of large-scale file cleaning. Here, we will not discuss the file cleaning operation based on the functions provided by the file system. Interested readers can try it.