Although this may not be possible without a loop in python, it can be done very quickly with numba and only at compile time. I proceeded from the assumption that your inputs can be easily represented as Boolean arrays, which are very simple to build from a binary file using struct . The method I implemented involves iterating several different objects, however these iterations were carefully chosen to make sure they are optimized for the compiler and never do the same job twice. The first iteration uses np.where to determine the indices of all bits to be deleted. This particular function (among many others) is optimized using the numba compiler. Then I use this list of bit indexes to build slice indices for bit fragments to save. The final loop copies these fragments to an empty output array.
import numpy as np from numba import jit from time import time def binary_mask(num, mask): num_nbits = num.shape[0] #how many bits are in our big num mask_bits = np.where(mask)[0] #which bits are we deleting mask_n_bits = mask_bits.shape[0] #how many bits are we deleting start = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice start indexes start[0] = 0 #first slice starts at 0 start[1:] = mask_bits + 1 #subsequent slices start 1 after each True bit in mask end = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice end indexes end[:mask_n_bits] = mask_bits #each slice ends on (but does not include) True bits in the mask end[mask_n_bits] = num_nbits + 1 #last slice goes all the way to the end out = np.empty(num_nbits - mask_n_bits, dtype=np.uint8) #preallocate return array for i in range(mask_n_bits + 1): #for each slice a = start[i] #use local variables to reduce number of lookups b = end[i] c = a - i d = b - i out[c:d] = num[a:b] #copy slices return out jit_binary_mask = jit("b1[:](b1[:], b1[:])")(binary_mask) #decorator without syntax sugar ###################### Benchmark ######################## bignum = np.random.randint(0,2,1000000, dtype=bool) # 1 million random bits bigmask = np.random.randint(0,10,1000000, dtype=np.uint8)==9 #delete about 1 in 10 bits t = time() for _ in range(10): #10 cycles of just numpy implementation out = binary_mask(bignum, bigmask) print(f"non-jit: {time()-t} seconds") t = time() out = jit_binary_mask(bignum, bigmask) #once ahead of time to compile compile_and_run = time() - t t = time() for _ in range(10): #10 cycles of compiled numpy implementation out = jit_binary_mask(bignum, bigmask) jit_runtime = time()-t print(f"jit: {jit_runtime} seconds") print(f"estimated compile_time: {compile_and_run - jit_runtime/10}")
In this example, I am benchmarking a 1,000,000-long boolean array for a total of 10 times for a compiled and non-compiled version. On my laptop, the output is:
non-jit: 1.865583896636963 seconds
jit: 0.06370806694030762 seconds
estimated compile_time: 0.1652850866317749
As you can see with such a simple algorithm, a very significant performance gain can be observed from compilation. (in my case, about 20-30 times faster)
Aaron source share