Best way to find recursive files having the same name but actually different with bash?

I have about 15,000 images in the structure of the attached files, whose names are SKUS. I need to make sure that there are no files with the same SKU, which are actually different files.

For example, if I have two or more files with the name:, MYSKU.jpgI need to make sure that none of them are different from each other.

What is the best way to do this in a bash command?

+4
source share
3 answers

I do not want to completely solve this problem, but here are some useful ingredients that you can try and integrate:

find /path -type f   # gives you a list of all files in /path

you can iterate over a list like this

for f in $(find /path -type f -name '*.jpg'); do
  ...
done

, .

base=$(basename $f)
full_path=$f
hash=$(echo $f | md5sum | awk '{print $1}')

, , , .

, , , . , .

, , : , , :

sort -k2    list.txt | column -t > list.sorted.txt       
sort -k2 -u list.txt | column -t > list.sorted.uniq.txt

, basename

diff list.sorted.txt list.sorted.uniq.txt

. MD5, , , , mv, rm, ln ..

+3

, , , md5

#!/bin/bash

# directory to scan
scan_dir=$1

[ ! -d "$1" ] && echo "Usage $0 <scan dir>" && exit 1

# Associative array to save hash table
declare -A HASH_TABLE
# Associative array of full path of items
declare -A FULL_PATH


for item in $( find $scan_dir -type f ) ; do
    file=$(basename $item)
    md5=$(md5sum $item | cut -f1 -d\ )
    if [ -z "${HASH_TABLE[$file]}" ] ; then
        HASH_TABLE[$file]=$md5
        FULL_PATH[$file]=$item
    else
        if [ "${HASH_TABLE[$file]}" != "$md5" ] ; then
            echo "differ $item from ${FULL_PATH[$file]}"
        fi
    fi
done

(, script scan_dir.sh:

$ ./scan_dir.sh /path/to/you/directory
+1

bash 4:

#!/usr/local/bin/bash -vx

#!/usr/local/bin/bash -vx

shopt -s globstar # turn on recursive globbing
shopt -s nullglob # hide globs that don't match anything
shopt -s nocaseglob # match globs regardless of capitalization

images=( **/*.{gif,jpeg,jpg,png} ) # all the image files
declare -A homonyms # associative array of like named files

for i in "${!images[@]}"; do # iterate over indices
    base=${images[i]##*/} # file name without path
    homonyms["$base"]+="$i " # Space delimited list of indices for this basename
done

for base in "${!homonyms[@]}"; do # distinct basenames
    unset dupehashes; declare -A dupehashes # temporary var for hashes
    indices=( ${homonyms["$base"]} ) # omit quotes to allow expansion of space-delimited integers
    (( ${#indices[@]} > 1 )) || continue # ignore unique names
    for i in "${indices[@]}"; do
        dupehashes[$(md5 < "${images[i]}")]+="$i "
    done

    (( ${#dupehashes[@]} > 1 )) || continue # ignore if same hash
    echo
    printf 'The following files have different hashes: '
    for h in "${!dupehashes[@]}"; do
        for i in ${dupehashes[$h]}; do # omit quotes to expand space-delimited integer list
            printf '%s %s\n' "$h" "${images[i]}"
        done
    done
done

I know this looks a lot, but I think that with 15 thousand images you really want to avoid open()ing and checksums that you don’t need, so this approach is set to narrow the data set to duplicate file names and only then the hash content. As others said earlier, you can do it even faster by checking the file sizes before hashing, but I will leave this part incomplete.

0
source

Source: https://habr.com/ru/post/1544409/


All Articles