What is the best and fastest way to delete a large directory containing thousands of files (in ubuntu)

As I know, teams like

find <dir> -type f -exec rm {} \; 

Not the best option for deleting a large number of files (total files, including a subfolder). It works well if you have a small number of files, but if you have 10+ mlns files in subfolders, it can hang the server.

Does anyone know any specific Linux commands to solve this problem?

+6
source share
9 answers

Here's an example bash script:

 #!/bin/bash local LOCKFILE=/tmp/rmHugeNumberOfFiles.lock # this process gets ultra-low priority ionice -c2 -n7 -p $$ > /dev/null if [ $? ]; then echo "Could not set disk IO priority. Exiting..." exit fi renice +19 -p $$ > /dev/null if [ $? ]; then echo "Could not renice process. Exiting..." exit fi # check if there an instance running already. If so--exit if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then echo "An instance of this script is already running." exit fi # make sure the lockfile is removed when we exit. Then: claim the lock trap "command rm -f -- $LOCKFILE; exit" INT TERM EXIT echo $$ > $LOCKFILE # also create a tempfile, and make sure that removed too upon exit tmp=$(tempfile) || exit trap "command rm -f -- '$tmp'" INT TERM EXIT # ---------------------------------------- # option 1 # ---------------------------------------- # find your specific files find "$1" -type f [INSERT SPECIFIC SEARCH PATTERN HERE] > "$tmp" cat $tmp | rm # ---------------------------------------- # option 2 # ---------------------------------------- command rm -r "$1" # remove the lockfile, tempfile command rm -f -- "$tmp" $LOCKFILE 

This script starts by setting its process priority and diskIO priority to very low values ​​to ensure that other running processes are not affected.

He then ensures that this is ONLY such a process.

The core script really depends on your preference. You can use rm -r if you are sure that the entire directory can be deleted indefinitely (option 2), or you can use find to delete the file more specifically (option 1, perhaps using the command line options "$ 2" and onw. for convenience).

In the implementation above, option 1 ( find ) first outputs everything to the temp file, so the rm function is called only once, and not after each file found by find . When the number of files is really huge, this can be a significant time saver. On the other hand, the size of a temporary file can be a problem, but this is only possible if you delete literally billions of files, and also because diskIO has such a low priority, using a temporary file, followed by one rm , possibly in overall, it will be slower than using the find (...) -exec rm {} \; option find (...) -exec rm {} \; . As always, you should experiment a bit to find out what best suits your needs.

EDIT: as suggested by find (...) -print0 | xargs -0 rm , you can also skip the whole tempfile and use find (...) -print0 | xargs -0 rm find (...) -print0 | xargs -0 rm . This has more memory since all the full paths to all the corresponding files will be inserted into RAM until the find is completely finished. On the other hand: there is no additional IO file due to writing to tempfile. Which one to choose depends on your use case.

+5
source

This may seem strange, but:

 $ rm -rf <dir> 
+7
source

The -r switch (recursive) also deletes everything under the directory, including subdirectories. (Your team does not delete directories, but only files.)

You can also speed up the find approach:

 find -type f -print0 | xargs -0 rm 
+1
source

I tried each of these commands, but the problem was that the removal process was locking the disk, and since no other processes could access it, there was a big pile of processes trying to access the disk, which complicates the problem, Run "iotop "and find out how much disk input your process is using.

Here is a python script that solved my problem. It deletes 500 files at a time, then takes a 2-second break for other processes to perform their activities, and then continues.

 import os, os.path import time for root, dirs, files in os.walk('/dir/to/delete/files'): i = 0 file_num = 0 for f in files: fullpath = os.path.join(root, f) i = i + 1 file_num = file_num + 1 os.remove(fullpath) if i%500 == 1: time.sleep(2) print "Deleted %i files" % file_num 

Hope this helps some people.

0
source

If you need to deal with the problem of space limitations in a very large file tree (in my case, many perforce branches), which are sometimes suspended when starting the search and delete process -

Here is a script that I plan to find all directories with a specific file daily ("ChangesLog.txt"), and then Sort all directories that are older than 2 days, and delete the first mapped directory (there may be a new match in each schedule):

 bash -c "echo @echo Creating Cleanup_Branch.cmd on %COMPUTERNAME% - %~dp0 > Cleanup_Branch.cmd" bash -c "echo -n 'bash -c \"find ' >> Cleanup_Branch.cmd" rm -f dirToDelete.txt rem cd. > dirToDelete.txt bash -c "find .. -maxdepth 9 -regex ".+ChangesLog.txt" -exec echo {} >> dirToDelete.txt \; & pid=$!; sleep 100; kill $pid " sed -e 's/\(.*\)\/.*/\1/' -e 's/^./"&/;s/.$/&" /' dirToDelete.txt | tr '\n' ' ' >> Cleanup_Branch.cmd bash -c "echo -n '-maxdepth 0 -type d -mtime +2 | xargs -r ls -trd | head -n1 | xargs -t rm -Rf' >> Cleanup_Branch.cmd" bash -c 'echo -n \" >> Cleanup_Branch.cmd' call Cleanup_Branch.cmd 

Pay attention to the requirements:

  • Removing only these directories using "ChangesLog.txt", as other old directories should not be deleted.
  • Calling OS commands in cygwin directly , because otherwise it used Windows commands by default.
  • Collecting directories for deletion into an external text file to save the search results , because sometimes the search process freezes.
  • Setting the time for the search process using and the background process, which will be killed after 100 seconds .
  • Sort the oldest directories for deletion priority.
0
source

If you have a fairly modern version of find (4.2.3 or higher), you can use the -delete flag.

 find <dir> -type f -delete 

If you have version 4.2.12 or higher, you can use the xargs-style command line styles using the \+ -exec modifier. Thus, you do not run a separate copy of /bin/rm for each file.

 find <dir> -type f -exec rm {} \+ 
0
source

Previous teams are good.

rm -rf directory/ also works faster for a billion files in one folder. I have tried this.

0
source

You can create an empty directory and RSYNC in the directory that you need to delete. You will avoid timeout and memory problems.

0
source

If you want to delete several files as soon as possible, try the following:

find . -type f -print0 | xargs -P 0 -0 rm -f

Note that the -P option will make xargs use processes as much as possible.

0
source

Source: https://habr.com/ru/post/919725/


All Articles