Should I compare bytes with struct?

Question

Should I compare bytes with struct?

I am trying to compare data in two files and get a list of offsets where differences exist. I tried it in some text files and it worked pretty well. However, in non-text files (which still contain ascii text), I call them binary data files. (executable files, etc.)

It seems that some bytes are the same, although when I look at it in a hex editor, they are clearly not there. I tried to print this binary data, which in his opinion match, and I get blank lines where it should be printed. So I think this is the source of the problem.

So, what is the best way to compare bytes of data that can be either binary or contain ascii text? I thought using the struct module was the starting point ...

As you can see below, I compare bytes with the == operator

Here is the code:

import os
import math


#file1 = 'file1.txt'
#file2 = 'file2.txt'
file1 = 'file1.exe'
file2 = 'file2.exe'
file1size = os.path.getsize(file1)
file2size = os.path.getsize(file2)
a = file1size - file2size
end = file1size  #if they are both same size
if a > 0:
    #file 2 is smallest
    end = file2size
    big = file1size

elif a < 0:
    #file 1 is smallest
    end = file1size
    big = file2size


f1 = open(file1, 'rb')
f2 = open(file2, 'rb')



readSize = 500
r = readSize
off = 0
data = []
looking = False
d = open('data.txt', 'w')


while off < end:
    f1.seek(off)
    f2.seek(off)
    b1, b2 = f1.read(r), f2.read(r)
    same = b1 == b2
    print ''
    if same:
        print 'Same at: '+str(off)
        print 'readSize: '+str(r)
        print b1
        print b2
        print ''
        #save offsets of the section of "different" bytes
        #data.append([diffOff, diffOff+off-1])  #[begin diff off, end diff off]
        if looking:
            d.write(str(diffOff)+" => "+str(diffOff+off-2)+"\n")
            looking = False
            r = readSize
            off = off + 1
        else:
            off = off + r

    else:
        if r == 1:
            looking = True
            diffOff = off
            off = off + 1 #continue reading 1 at a time, until u find a same reading
        r = 1  #it will shoot back to the last off, since we didn't increment it here



d.close()
f1.close()
f2.close()          

#add the diff ending portion to diff data offs, if 1 file is longer than the other
a = int(math.fabs(a))  #get abs val of diff
if a:
    data.append([big-a, big-1])


print data

+3

python comparison file byte

chazzycheese Aug 14 '10 at 17:50

source share

2 answers

Jungle Hunter · Answer 1 · 2010-08-14T17:57:55+0000

Have you tried diffliband filecmpmodules?

This module provides classes and functions for comparing sequences. This can be used, for example, to compare files, and can lead to differences in information in various formats, including HTML and context, and unified differentials. For a comparison of directories and files, see also the module filecmp.
filecmp , / . . difflib

.

bstpierre · Answer 2 · 2010-08-15T03:37:13+0000

, /. - , bytearray, :

:

$ od -Ax -tx1 /tmp/aa
000000 e0 b2 aa 0a
$ od -Ax -tx1 /tmp/bb
000000 e0 b2 bb 0a

$ cat /tmp/diff.py 
a = bytearray(open('/tmp/aa', 'rb').read())
b = bytearray(open('/tmp/bb', 'rb').read())
print "%02x, %02x" % (a[2], a[3])
print "%02x, %02x" % (b[2], b[3])

$ python /tmp/diff.py 
aa, 0a
bb, 0a

Should I compare bytes with struct?

More articles: