The fastest way to get hamming distance for an integer array

Let a and b be vectors of the same size with 8-bit integers (0-255). I want to calculate the number of bits where these vectors differ, i.e. The Hamming distance between vectors formed by concatenating binary representations of these numbers. For instance:

a = [127,255] b= [127,240] 

Using the numpy library

 np.bitwise_xor(a,b) # Output: array([ 0, 15]) 

Now I need to binary represent each element of the specified array and count the number of 1 in all elements of the array. The above example will give a distance for hamming 0 + 4 = 4. Any quick and elegant solution for this in Python?

+5
source share
4 answers

Approach # 1: We could pass them into binary bits and count the number of different bits, for example:

 def hamming_distance(a, b): r = (1 << np.arange(8))[:,None] return np.count_nonzero( (a & r) != (b & r) ) 

Run Example -

 In [144]: a = [127,255] ...: b = [127,240] ...: In [145]: hamming_distance(a, b) Out[145]: 4 

Approach # 2: Using bitwise-xor , we can find out the number of different binary bits between a and b -

 def hamming_distance_v2(a, b): r = (1 << np.arange(8))[:,None] return np.count_nonzero((np.bitwise_xor(a,b) & r) != 0) 
+6
source

If you intend to repeatedly call the distance function during one run of your program, you can get some speed using a pre-computed bit table. Here is (another) version of the Hamming distance function:

 # _nbits[k] is the number of 1s in the binary representation of k for 0 <= k < 256. _nbits = np.array( [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8], dtype=np.uint8) def hamming_distance1(a, b): c = np.bitwise_xor(a, b) n = _nbits[c].sum() return n 

Next, a and b are 32 Python lists listed in the comment to the question. divakar_hamming_distance() and divakar_hamming_distance_v2() are from @Divakar's answer.

Following are the timings of @Divakar's features:

 In [116]: %timeit divakar_hamming_distance(a, b) The slowest run took 5.57 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 11.3 µs per loop In [117]: %timeit divakar_hamming_distance_v2(a, b) The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 10.3 µs per loop 

hamming_distance1(a, b) little faster:

 In [118]: %timeit hamming_distance1(a, b) The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 7.42 µs per loop 

On my computer, initializing _nbits takes about 11 microseconds, so it makes no sense to use hamming_distance1 if you call the function only once. If you call it three or more times, there is a net profit in productivity.

If the inputs already have numerical arrays, all functions are much faster:

 In [119]: aa = np.array(a) In [120]: bb = np.array(b) In [121]: %timeit divakar_hamming_distance_v2(aa, bb) The slowest run took 8.22 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 5.72 µs per loop In [122]: %timeit hamming_distance1(aa, bb) The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 2.77 µs per loop 

Of course, if you always do this immediately before calculating the Hamming distance, the time required for the conversion should be included in the total time. However, if you write code that generates a and b to take advantage of numpy earlier, you may already have them in numpy arrays by the time you calculate the Hamming distance.


(I also experimented a bit with a two-dimensional array of pre-calculated Hamming distances between 8-bit values ​​- an array with the form (256, 256), but the initialization cost is higher and the performance gain is small.)

+4
source

maybe not the most efficient way, but the easiest imo is to convert your ouptut array to strings in binary form, and then take the sum of all the characters converted back to ints ...

 import numpy as np output = np.random.randint(0,63,10) hamming = ['{:b}'.format(x).count('1') for x in output] 
+1
source
 sum(bin(x).count("1") for x in np.bitwise_xor(a,b)) 
0
source

Source: https://habr.com/ru/post/1260543/


All Articles