Python bitmask (variable length)

to solve one research problem, we need to organize a search for a bit mask in python. As input, we have raw data (we represent it as a sequence of bits). The size is about 1.5 GB. as a conclusion, we should get the number of occurrences of specific bitmasks. Let me give an example to describe the situation.

input:    sequence of bits, a bitmask to search(mask length: 12bits)

The first idea (inefficient) is to use XOR as follows:

1step: from input we take 12 first bits(position 0 to 11) and make XOR with mask 
2step: from input we take bits from 1 to 12 position and XOR with mask ...

take the first two steps:

input sequence 100100011110101010110110011010100101010110101010
mask to search: 100100011110
step 1: take first 12 bits from input: 100100011110 and XOR it with mask.
step 2: teke bits from 1 to 12position: 001000111101 and XOR it with mask.
...

The problem is this: how to organize the reception of bits from the input? we can take the first 12 bits, but how can we take a bit from 1 to 12 positions that we need to continue the next iteration?

BitString python, , , . . 12 256. ? python

+3
3

- "" ​​ , , , . KMP, , .

O(n*m) O(n+m).

+4

8 , , ( , , , .)

sequence = <list of 8-bit integers>
mask = [0b10010001, 0b01101101]
matches = my_substring_search(sequence, mask)

8 , , 8 , . .

sequence = <list of 8-bit integers>
mask_a = [0b10010001]
mask_b = 0b01100000
mask_b_pattern = 0b11110000   # relevant bits of mask_b
matches = my_substring_search(sequence, mask_a)

for match in matches:
    if (sequence[match+len(mask_a)] & mask_b_pattern) == mask_b:
        valid_match = True  # or something more useful...

sequence - 4096 , . , sequence 4096+ceil(mask_bits/8.0), 4096 , .


:

class Mask(object):
    def __init__(self, source, source_mask):
        self._masks = list(self._generate_masks(source, source_mask))

    def match(self, buffer, i, j):
        return any(m.match(buffer, i, j) for m in self._masks)

    class MaskBits(object):
        def __init__(self, pre, pre_mask, match_bytes, post, post_mask):
            self.match_bytes = match_bytes
            self.pre, self.pre_mask = pre, pre_mask
            self.post, self.post_mask = post, post_mask

        def __repr__(self):
            return '(%02x %02x) (%s) (%02x %02x)' % (self.pre, self.pre_mask,
                ', '.join('%02x' % m for m in self.match_bytes),
                self.post, self.post_mask)

        def match(self, buffer, i, j):
            return (buffer[i:j] == self.match_bytes and
                    buffer[i-1] & self.pre_mask == self.pre and
                    buffer[j] & self.post_mask == self.post)

    def _generate_masks(self, src, src_mask):
        pre_mask = 0
        pre = 0
        post_mask = 0
        post = 0
        while pre_mask != 0xFF:
            src_bytes = []
            for i in (24, 16, 8, 0):
                if (src_mask >> i) & 0xFF == 0xFF:
                    src_bytes.append((src >> i) & 0xFF)
                else:
                    post_mask = (src_mask >> i) & 0xFF
                    post = (src >> i) & 0xFF
                    break
            yield self.MaskBits(pre, pre_mask, src_bytes, post, post_mask)
            pre += pre
            pre_mask += pre_mask
            if src & 0x80000000: pre |= 0x00000001
            pre_mask |= 0x00000001
            src = (src & 0x7FFFFFFF) * 2
            src_mask = (src_mask & 0x7FFFFFFF) * 2

, . Mask , , ( ) 32 :

src = 0b11101011011011010101001010100000
src_mask = 0b11111111111111111111111111100000

:

buffer_1 = [0x7e, 0xb6, 0xd5, 0x2b, 0x88]

Mask :

>> m = Mask(src, src_mask)
>> m._masks
[(00 00) (eb, 6d, 52) (a0 e0),
 (01 01) (d6, da, a5) (40 c0),
 (03 03) (ad, b5, 4a) (80 80),
 (07 07) (5b, 6a, 95) (00 00),
 (0e 0f) (b6, d5) (2a fe),
 (1d 1f) (6d, aa) (54 fc),
 (3a 3f) (db, 54) (a8 f8),
 (75 7f) (b6, a9) (50 f0)]

- ( , m._masks[i].match_bytes). , m.match(buffer, i, j), i - , j - ( , buffer[i:j] == match_bytes).

buffer , 5, , _masks[4].match_bytes buffer[1:3]. :

>> m.match(buffer, 1, 3)
True

( , , , . - - , , !)

+2

, . , , "" , 50% ( ).

bitstring ( ), . , , - , , ! :

match "0" "1", Python find . , .

masks = ['0000101010100101', '010100011110110101101', '01010101101']
byte_data_chunk = bytearray('blahblahblah')
# convert to a string with one character per bit
# uses lookup table not given here!
s = ''.join(BYTE_TO_BITS[x] for x in byte_data_chunk)
for mask in masks:
    p = s.find(mask)
    # etc.

, find, , . bitstring, , .

+1

Source: https://habr.com/ru/post/1790334/


All Articles