Linux: calculate one hash for a given folder and content?

Question

Linux: calculate one hash for a given folder and content?

Of course, there must be a way to make it easy!

I tried Linux command line applications such as sha1sum and md5sum but they seem to be able to calculate the hashes of individual files and list the hash values, one for each file.

I need to generate one hash for the entire contents of a folder (not just file names).

I would like to do something like

 sha1sum /folder/of/stuff > singlehashvalue

Edit: to clarify, my files are located at several levels in the directory tree, they are not all in the same root folder.

+70

python linux bash hash

Ben L Feb 13 '09 at 9:51

source share

15 answers

Use a file system intrusion detection tool such as aide .
hash tartar from the directory:
tar cvf -/path/to/folder | sha1sum
Write something yourself, e.g. vatine oneliner
find/path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

+20

David Schmitt Feb 13 '09 at 10:04

source share

You can do tar -c/path/to/folder | sha1sum tar -c/path/to/folder | sha1sum tar -c/path/to/folder | sha1sum tar -c/path/to/folder | sha1sum

+9

S.Lott Feb 13 '09 at 11:04

source share

If you just want to check if something in the folder has changed, I would recommend the following:

 ls -alR --full-time /folder/of/stuff | sha1sum

It will just give you an ls output hash that contains folders, subfolders, their files, their timestamp, size and permissions. Almost everything you need to determine if something has changed.

Note that this command will not generate a hash for each file, but therefore it should be faster than using find.

+6

Shumoapp Dec 08 '16 at 0:09

source share

If you just want to hash the contents of files ignoring file names, you can use

 cat $FILES | md5sum

Make sure you have the files in the same order when calculating the hash:

 cat $(echo $FILES | sort) | md5sum

But you cannot have directories in your file list.

+3

unbeknown Feb 13 '09 at 9:54

source share

There is a python script for this:

http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/

If you change the file names without changing their alphabetical order, the script hash will not detect it. But if you change the order of the files or the contents of any file, running the script will give you a different hash than before.

+2

Kingdon Jan 25 '11 at 17:12

source share

Reliable and clean approach

First things first, do not clog the available memory ! Hash the file in chunks, not feed the entire file.
Different approaches for different needs / goals (all below or choose what is ever applicable):
- Hash only the entry name of all entries in the directory tree
- Hash the contents of the file of all records (leaving metadata, inode number, ctime, atime, mtime, size, etc., you get an idea)
- For a symbolic link, its contents are the reference name. Hash or select skip
- Follow or not follow (resolved name) via a symbolic link when hashing the contents of a record
- If it is a directory, its contents are just directory entries. With a recursive traversal, they will eventually be hashed, but should names of entries at this level be hashed to mark this directory? Useful for use cases where a hash code is required to quickly identify a change without having to look deeply for hashed content. An example is changing the file name, but the rest of the content remains the same, and they are all quite large files.
- Handle large files well (again, pay attention to RAM)
- Handling very deep directory trees (pay attention to open file descriptors)
- Handle custom file names
- What to do with files that are sockets, channels / FIFOs, block devices, character devices? Should they hash them?
- Do not update the access time of any record during the crawl, because it will be a side effect and unproductive (intuitive?) For certain use cases.

This is what is on my head, anyone who has spent some time working on this would almost catch other mistakes and angular cases.

Here, a tool that is very easy to remember, which solves most cases, can be a little rough around the edges, but it was very useful.

Example usage and output of `dtreetrawl` .

 Usage: dtreetrawl [OPTION...] "/trawl/me" [path2,...] Help Options: -h, --help Show help options Application Options: -t, --terse Produce a terse output; parsable. -j, --json Output as JSON -d, --delim=: Character or string delimiter/separator for terse output(default ':') -l, --max-level=N Do not traverse tree beyond N level(s) --hash Enable hashing(default is MD5). -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512. -R, --only-root-hash Output only the root hash. Blank line if --hash is not set -N, --no-name-hash Exclude path name while calculating the root checksum -F, --no-content-hash Do not hash the contents of the file -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum -e, --hash-dirent Include hash of directory entries while calculating root checksum

A fragment of a human-friendly result:

 ... ... //clipped ... /home/lab/linux-4.14-rc8/CREDITS Base name : CREDITS Level : 1 Type : regular file Referent name : File size : 98443 bytes I-node number : 290850 No. directory entries : 0 Permission (octal) : 0644 Link count : 1 Ownership : UID=0, GID=0 Preferred I/O block size : 4096 bytes Blocks allocated : 200 Last status change : Tue, 21 Nov 17 21:28:18 +0530 Last file access : Thu, 28 Dec 17 00:53:27 +0530 Last file modification : Tue, 21 Nov 17 21:28:18 +0530 Hash : 9f0312d130016d103aa5fc9d16a2437e Stats for /home/lab/linux-4.14-rc8: Elapsed time : 1.305767 s Start time : Sun, 07 Jan 18 03:42:39 +0530 Root hash : 434e93111ad6f9335bb4954bc8f4eca4 Hash type : md5 Depth : 8 Total, size : 66850916 bytes entries : 12484 directories : 763 regular files : 11715 symlinks : 6 block devices : 0 char devices : 0 sockets : 0 FIFOs/pipes : 0

+2

six-k Jan 07 '18 at 11:39

source share

I would pass the results for individual files through sort (to prevent a simple reordering of files to change the hash) to md5sum or sha1sum , depending on what you choose.

+1

Rafał Dowgird Feb 13 '09 at 9:58

source share

Another tool to achieve this:

http://md5deep.sourceforge.net/

Like sounds: for example, md5sum, but also recursive and other functions.

+1

Jack Jul 29 '15 at 13:35

source share

I wrote a Groovy script to do this:

 import java.security.MessageDigest public static String generateDigest(File file, String digest, int paddedLength){ MessageDigest md = MessageDigest.getInstance(digest) md.reset() def files = [] def directories = [] if(file.isDirectory()){ file.eachFileRecurse(){sf -> if(sf.isFile()){ files.add(sf) } else{ directories.add(file.toURI().relativize(sf.toURI()).toString()) } } } else if(file.isFile()){ files.add(file) } files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()}) directories.sort() files.each(){f -> println file.toURI().relativize(f.toURI()).toString() f.withInputStream(){is -> byte[] buffer = new byte[8192] int read = 0 while((read = is.read(buffer)) > 0){ md.update(buffer, 0, read) } } } directories.each(){d -> println d md.update(d.getBytes()) } byte[] digestBytes = md.digest() BigInteger bigInt = new BigInteger(1, digestBytes) return bigInt.toString(16).padLeft(paddedLength, '0') } println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"

You can configure the use so as not to print each file, change the message digest, take out directory hashing, etc. I tested it against NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/

 gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config .DS_Store configstore/bower-github.yml configstore/insight-bower.json configstore/update-notifier-bower.json filezilla/filezilla.xml filezilla/layout.xml filezilla/lockfile filezilla/queue.sqlite3 filezilla/recentservers.xml filezilla/sitemanager.xml gtk-2.0/gtkfilechooser.ini a/ configstore/ filezilla/ gtk-2.0/ lftp/ menus/ menus/applications-merged/ 79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758

+1

haventchecked Mar 28 '16 at 20:53 on

source share

Try this in two steps:

create a hash file for all files in the folder
hash this file

Same:

 # for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done # sha1sum hashes

Or do it all at once:

 # cat `find /folder/of/stuff -type f | sort` | sha1sum

0

Joao da Silva Feb 13 '09 at 9:57

source share

You can sha1sum generate a list of hash values, and then sha1sum this list again, it depends on what exactly you want to execute.

0

Ronny Vindenes Feb 13 '09 at 9:57

source share

I had to check the entire directory for file changes.

But with the exception of time stamps, directory owners.

The goal is to get the same amount anywhere if the files are identical.

Including placed on other machines, regardless of anything other than files or changes to them.

 md5sum * | md5sum | cut -d' ' -f1

It generates a list of hashes by file, and then combines these hashes into one.

This is much faster than the tar method.

For greater confidentiality of our hashes, we can use sha512sum according to the same recipe.

 sha512sum * | sha512sum | cut -d' ' -f1

Hashes are also identical wherever sha512sum is used , but there is no known way to change this.

0

NVRM Jan 28 '18 at 15:17

source share

Here is a simple, short version in Python 3 that works great for small files (like a source tree or something where each file can easily fit in RAM), ignoring empty directories based on ideas from other solutions:

 import os, hashlib def hash_for_directory(path, hashfunc=hashlib.sha1): filenames = sorted(os.path.join(dp, fn) for dp, _, fns in os.walk(path) for fn in fns) index = '\n'.join('{}={}'.format(os.path.relpath(fn, path), hashfunc(open(fn, 'rb').read()).hexdigest()) for fn in filenames) return hashfunc(index.encode('utf-8')).hexdigest()

It works like this:

Find all files in a directory recursively and sort them by name
Calculate the hash (default: SHA-1) of each file (reads the entire file into memory)
Create a text index with the lines "filename = hash"
Encode this index back to the UTF-8 byte string and hash

You can pass another hash function as a second parameter if SHA-1 is not your cup of tea.

0

Thomas Perl Mar 08 '18 at 11:17

source share

If this is a git repo and you want to ignore any files in .gitignore , you can use this:

 git ls-files <your_directory> | xargs sha256sum | cut -d" " -f1 | sha256sum | cut -d" " -f1

This works well for me.

0

ndbroadbent Jul 07 '19 at 0:01

source share

Vatine · Accepted Answer · 2009-02-13 09:59

One possible way:

 sha1sum path / to / folder / * |  sha1sum

If there is a whole directory tree, you are probably better off using find and xargs. One possible command would be

 find path / to / folder -type f -print0 |  sort -z |  xargs -0 sha1sum |  sha1sum

And finally, if you also need to consider permissions and empty directories:

 (find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum; find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \ xargs -0 stat -c '%n %a') \ | sha1sum

Arguments for stat will force it to print the file name followed by octal permissions. Two searches will be performed one after another, doubling the number of disk I / O operations: the first will find all the file names and check the checksum, the second will find all the file and directory names, name and print mode. The list of “file names and checksums” followed by “names and directories with permissions” will then be the checksum for the smaller checksum.

Linux: calculate one hash for a given folder and content?

Reliable and clean approach

Example usage and output of dtreetrawl .

More articles:

Example usage and output of `dtreetrawl` .