Removing identical files on UNIX

I am dealing with a large number (30,000) of files about 10 MB in size. Some of them (I estimate 2%) are actually duplicated, and I only need to store a copy for each duplicated pair (or triplet). Could you offer me an effective way to do this? I am working on Unix.

+3
source share
6 answers

I would write a script to create a hash of each file. You can store hashes in the set, iterate over files and hash the file to the value already found in the set, delete the file. This would be trivial to do in Python, for example.

For 30,000 files, in 64 bytes to write to a hash table, you only look at about 200 megabytes.

+2
source

Find possible duplicate files:

find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40

Now you can use cmpto verify that the files are really identical.

+2
source

, .

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}' 
+2

script, , MD5 ( , ), , , . , .., .

+1

Save all file names in an array. Then go through the array. At each iteration, compare the contents of the file with the other contents of the file using the command md5sum. If MD5 matches, delete the file.

For example, if the file bis a duplicate of the file a, it md5sumwill be the same for both files.

0
source

There is a tool for this: fdupes

Recovering a solution from an old deleted response.

0
source

Source: https://habr.com/ru/post/1735898/


All Articles