Finding duplicate files by content in multiple directories

I have downloaded several files from the Internet related to a specific topic. Now I want to check if the files have any duplicates. The problem is that the file names will be different, but the contents may be consistent.

Is there a way to implement some code that will iterate over multiple folders and tell which files are duplicated?

+3
source share
4 answers

if you are running linux / * nix systems, you can use tools shasuch as sha512sumnow that md5 may be corrupted.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}' 

if you want to work with Python, a simple implementation

import hashlib,os
def sha(filename):    
    ''' function to get sha of file '''
    d = hashlib.sha512()
    try:
        d.update(open(filename).read())
    except Exception,e:
        print e
    else:
        return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
    for files in f:
        filename=os.path.join(r,files)
        digest=sha(filename)
        if not s.has_key(digest):
            s[digest]=filename
        else:
            print "Duplicates: %s <==> %s " %( filename, s[digest])

, sha512sum , unix, diff, filecmp (Python)

+5

MD5 , MD5, . ?

Perl :

use strict;
use File::Find;
use Digest::MD5 qw(md5);    

my @directories_to_search = ('a','e');
my %hash;

find(\&wanted, @directories_to_search);

sub wanted  {

        chdir $File::Find::dir;
        if( -f $_) {
                my $con = '';
                open F,"<",$_ or die;
                while(my $line = <F>) {
                        $con .= $line;
                }
                close F;
                if($hash{md5($con)}) {
                        print "Dup found: $File::Find::name and $hash{md5($con)}\n";
                } else {
                        $hash{md5($con)} = $File::Find::name;
                }
        }
}
+4

, , , MD5 - SHA1, , .

Regex .

, . (, , - !)

+2

MD5 - , , ! ( , ),

PS: , , "\n" linux

EDIT:

: md5: ( MD5 (wikipedia))

However, now that it is easy to generate MD5 collisions, it is possible for the person who created the file to create a second file with the same checksum, so this method cannot protect malicious forms from some forms. In addition, in some cases, the checksum cannot be trusted (for example, if it was received for the same channel as the downloaded file), in which case MD5 can provide error checking: it recognizes corruption or incomplete downloads that become more probably when downloading large files.

+2
source

Source: https://habr.com/ru/post/1735859/


All Articles