I am trying to compare data in two files and get a list of offsets where differences exist. I tried it in some text files and it worked pretty well. However, in non-text files (which still contain ascii text), I call them binary data files. (executable files, etc.)
It seems that some bytes are the same, although when I look at it in a hex editor, they are clearly not there. I tried to print this binary data, which in his opinion match, and I get blank lines where it should be printed. So I think this is the source of the problem.
So, what is the best way to compare bytes of data that can be either binary or contain ascii text? I thought using the struct module was the starting point ...
As you can see below, I compare bytes with the == operator
Here is the code:
import os
import math
file1 = 'file1.exe'
file2 = 'file2.exe'
file1size = os.path.getsize(file1)
file2size = os.path.getsize(file2)
a = file1size - file2size
end = file1size
if a > 0:
end = file2size
big = file1size
elif a < 0:
end = file1size
big = file2size
f1 = open(file1, 'rb')
f2 = open(file2, 'rb')
readSize = 500
r = readSize
off = 0
data = []
looking = False
d = open('data.txt', 'w')
while off < end:
f1.seek(off)
f2.seek(off)
b1, b2 = f1.read(r), f2.read(r)
same = b1 == b2
print ''
if same:
print 'Same at: '+str(off)
print 'readSize: '+str(r)
print b1
print b2
print ''
if looking:
d.write(str(diffOff)+" => "+str(diffOff+off-2)+"\n")
looking = False
r = readSize
off = off + 1
else:
off = off + r
else:
if r == 1:
looking = True
diffOff = off
off = off + 1
r = 1
d.close()
f1.close()
f2.close()
a = int(math.fabs(a))
if a:
data.append([big-a, big-1])
print data
source
share