Speeding up file associations (with `cmp`) on Cygwin?

Question

Speeding up file associations (with `cmp`) on Cygwin?

I wrote a bash script in Cygwin, which is more like rsync , although it is quite different that I believe that I can not use rsync for what I need. It iterates about a thousand pairs of files in the corresponding directories, comparing them with cmp .

Unfortunately, this seems to work terribly slowly - it takes about ten (Edit: actually 25!) Times how long it takes to create one of the filesets using the Python program.

Am I right in thinking this is surprisingly slow? Are there any simple alternatives that will go faster?

(To tell you in detail about my use case: I create auto-generation of a group of .c files in a temporary directory, and when I recreate them, I would like to copy only those that were changed to the actual source directory, leaving the unexplored ones unchanged (with their old times creation), so make will know that he doesn’t need to recompile them. Not all generated files are .c files, however, so I need to do binary comparisons, not text comparisons.)

+6

bash cygwin

Brooks moses Jan 24 '12 at 3:07

source share

2 answers

If you can intelligently compare thousands of odd files in one process, rather than create and execute thousands of additional programs, this is likely to be ideal.

Short answer: add --silent to your cmp call, if not already.

You may be able to speed up the Python version by doing some file size checks before checking the data.

Firstly, the fast and hacker method bash(1) , which can be much simpler if you can go to the same build directory: use the bash -N test:

 $ echo foo > file $ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi newer than last read $ cat file foo $ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi older than last read $ echo blort > file # regenerate the file here $ if [ -N file ] ; then echo newer than last read ; else echo older than last read ; fi newer than last read $

Of course, if some subset of the files depends on some other subset of the generated files, this approach will not work at all. (This may be reason enough to avoid this technique, it is up to you.)

As part of your Python program, you can also check file sizes using os.stat() to determine if you should call your comparison procedure; if the files are of different sizes, you don't care what bytes are changed, so you can skip reading both files. (This would be difficult to do in bash(1) - I don’t know the mechanism to get the file size in bash(1) without executing another program that defeats the whole point of this check.)

cmp will do a size comparison inside IFF when you use the --silent flag and both files are regular files and both files are in the same place. (This is set using the --ignore-initial flag.) If you are not using --silent , add it and see what the difference is.

+1

sarnold Jan 24 '12 at 3:42

source share

Jonathan leffler · Accepted Answer · 2012-01-24T03:35:38+0000

Maybe you should use Python to perform some or all of the comparison results?

One improvement would only be to run cmp if the file sizes are the same; if they are different, it is clear that the file has changed. Instead of running cmp you might consider generating a hash for each file using MD5 or SHA1 or SHA-256, or something that interests you (using Python modules or extensions, if that is the correct term). If you do not think that you will be dealing with malicious intentions, then MD5 is probably sufficient to detect differences.

Even in a shell script, you can run an external hash command and give it the names of all files in one directory, and then give them the names of all files in another directory. You can then read the two sets of hash values plus the file names and decide which ones have changed.

Yes, it looks like he's taking too long. But the problem is to run 1000 copies of cmp , as well as other processing. Both the Python suggestions above and the shell script have a general idea that they do not run the program 1000 times; they try to minimize the number of running programs. This reduction in the number of completed processes will give you a pretty big boost for you, I expect.

If you can save the hashes from the "current file set" and simply create new hashes for the new file set and then compare them, you will succeed. Obviously, if a file containing "old hashes" (the current set of files) is missing, you will have to regenerate it from existing files. This makes the information in the comments a bit more specific.

Another possibility: you can track changes in the data that you use to create these files, and use this to tell you which files will be changed (or at least limit the set of files that can be changed, and therefore , you need to compare, as your comments show that most files are the same every time).

Speeding up file associations (with `cmp`) on Cygwin?

More articles: