Removing identical files on UNIX

Question

Removing identical files on UNIX

I am dealing with a large number (30,000) of files about 10 MB in size. Some of them (I estimate 2%) are actually duplicated, and I only need to store a copy for each duplicated pair (or triplet). Could you offer me an effective way to do this? I am working on Unix.

+3

unix file shell duplicate-removal duplicates

Federico giorgi Mar 08 '10 at 10:20

source share

6 answers

Joe · Answer 1 · 2010-03-08T10:23:54+0000

I would write a script to create a hash of each file. You can store hashes in the set, iterate over files and hash the file to the value already found in the set, delete the file. This would be trivial to do in Python, for example.

For 30,000 files, in 64 bytes to write to a hash table, you only look at about 200 megabytes.

Aaron digulla · Answer 2 · 2010-03-08T10:24:48+0000

Find possible duplicate files:

find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40

Now you can use cmpto verify that the files are really identical.

ghostdog74 · Answer 3 · 2010-03-08T10:30:08+0000

, .

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}'

Kilian Foth · Answer 4 · 2010-03-08T10:24:11+0000

script, , MD5 ( , ), , , . , .., .

karthi_ms · Answer 5 · 2010-03-08T11:02:17+0000

Save all file names in an array. Then go through the array. At each iteration, compare the contents of the file with the other contents of the file using the command md5sum. If MD5 matches, delete the file.

For example, if the file bis a duplicate of the file a, it md5sumwill be the same for both files.

tripleee · Answer 6 · 2019-01-12T13:11:29+0000

There is a tool for this: fdupes

Recovering a solution from an old deleted response.

Removing identical files on UNIX

More articles: