Binary sequence detector

Does anyone know of an optimized way to detect a 37-bit sequence in a piece of binary data that is optimal. Of course, I can compare brute force with a window (just compare, starting at index 0 + the next 36 bits, the increment and loop until I find it), but is there a better way? Maybe some hash search returns the probability that the sequence lies inside the binary block? Or am I just pulling it out of my butt? Anyway, I'm going to start looking for brute force, but I was curious if something was more optimal. This, by the way, is in C.

+3
source share
6 answers

Interest Ask. I assume that your 37-bit sequence can start at any point in the byte. Let's say your sequence is represented as follows:

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 @

If we have a byte-aligned algorithm, we can see these 32-bit byte sequences:

BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A]
CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.]
DEFGHIJKLMNOPQRSTUVWXYZ012345678
EFGHIJKLMNOPQRSTUVWXYZ0123456789
FGHIJKLMNOPQRSTUVWXYZ0123456789 @ 
GHIJKLMNOPQRSTUVWXYZ0123456789 @ x 
HIJKLMNOPQRSTUVWXYZ0123456789 @ xx 
IJKLMNOPQRSTUVWXYZ0123456789

- - , 37 .

:

unsigned char *p = ...; // input data
size_t n = ...;  // bytes available
size_t bitpos;

--n; p++;
bitpos = 0;

while (n--) {
  uint32_t word = *(uint32_t*)p; // nonportable, sorry.
  bitpos += 8; // compiler should be able to optimise this variable out completely

  if (word == w_A) {
    if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) {
      // we found the data starting at the 8th bit of p-1
      found_at(bitpos-1);
    }
  } else if (word == w_B) {
    if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) {
      // we found the data starting at the 7th bit of p-1
      found_at (bitpos-2);
    }
  } else if (word == w_C} {
     ...
  }
...
}

Obviously, there are problems with this strategy. First, he can first evaluate p [-1] around the loop, but this is easy to fix. Secondly, it extracts a word from odd addresses; which usually do not work on some processors - for example, SPARC and 68k. But doing this is an easy way to throw 4 comparisons into one.

Sentence

kek444 allows you to use an algorithm such as KMP to skip ahead in the data stream. However, the maximum size of the gaps is small, therefore, although the Turbo Boyer-Moore algorithm can reduce the number of byte comparisons by 4 or so, it may not be very advantageous if the cost of comparing bytes is similar to the cost of comparing words.

+4

, {0,1} , - .

+6

N , , , , , ( , ).

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
<--            N bits           -->
<--   'ugly' M bits    -->|<-- continue here

.

, , DFA, . .

+1

, , , , . , xor , 0, . . , 2 . 2 , . 17 , . ( , , )

/* Data is passed in, and offset is the number of bits offset from the first
   bit where the mask is located
   returns true if match was found.
*/
bool checkData(char* data, int* offset)
{
    /* Mask to mask off the first bits  not being used or examined*/
    static char firstMask[8] = { 0xFF, 0x7F, 0x3F, 0x1F, 0x0F, 0x07, 0x03, 0x01 };

    /* Mask to mask off the end bits not used  or examined*/
    static char endMask[8] = { 0x80, 0xC0, 0xE0, 0x0F, 0xF8, 0xFC, 0xFE, 0xFF };

    /* Pattern which is being search, with each row being the about shifted and 
       columns contain the pattern to be compared.  for example index 0 is a 
       shift of 0 bits in the pattern and 7 is a shift of seven bits
       NOTE: Bits not being used are set to zero.  
    */
    static char pattern[8][3] = { { 0xFF, 0xFF, 0x80 },  /* Original pattern */
                                  { 0x8F, 0xFF, 0xC0 },  /* Shifted by one */
                                  { 0x3F, 0xFF, 0xE0 },  /* Shifted by two */
                                  { 0x1F, 0xFF, 0xF0 },
                                  { 0x0F, 0xFF, 0xF8 },
                                  { 0x07, 0xFF, 0xFC },
                                  { 0x03, 0xFF, 0xFE },
                                  { 0x01, 0xFF, 0xFF }}; /* shifted by seven */

    /* outer loop control variable */
    int lcv;

    /* inter loop control variable */
    int lcv2;

    /* value to to contain the value results */
    char value;

    /* if there is no match, pass back a negative number to indicate no match */
    *offset = -1;

    /* Loop through the shifted patterns looking for a match */
    for ( lcv = 0; lcv < 8 ; lcv++ ) 
    {
        /* check the first part of the pattern.  
           mask of part that is not to be check and xor it with the 
           first part of the pattern */

        value = (firstMask[lcv] & *data) ^ pattern[lcv][0];
        /* if value is not zero, no match, so goto the next */
        if ( 0 != value ) 
        {
            continue;
        }

        /* loop through the middle of the pattern make sure it matches
           if it does not, break the loop
           NOTE:  Adjust the condition to match 1 less then the number 
                  of 8 bit items  you are comparing
        */
        for ( lcv2 = 1; lcv2 < 2; lcv2++)
        {
            if ( 0 != (*(data+lcv2)^pattern[lcv][lcv2]))
            {
                break;
            }
        }

        /* if the end of the loop was not reached, pattern 
           does not match, to continue to the next one
           NOTE: See note above about the condition 
        */   
        if ( 2 != lcv2)
        {
          continue;
        }

        /* Check the end of the pattern to see if there is a match after masking
           off the bits which are not being checked.
        */  
        value = (*(data + lcv2) & endMask[lcv]) ^ pattern[lcv][lcv2];

        /* if value is not zero, no match so continue */
        if ( 0 != value ) 
        {
          continue;
        }
    }
    /* If the end of the loop was not reached, set the offset as it 
       is the number of bits the pattern is offset in the byte and 
       return true
    */
    if ( lcv < 8 ) 
    {
        *offset = lcv ;
        return true;
    }
    /* No match was found */
    return false;
}

. , .

, , .

, 37 .

0

B , , , 37- .

  • , .
  • 0, 0 .
  • 1..7, , , .
  • , , 8 , . , , 29, .

, , , . 256 8 , 256 , , 0s . 2 3 O (1), O (8), .

, , , , 29 ( 8..36). O (1) .

; , .

0

: bitstring - . , Aho-Corasick, , , , .

( 8 , , . , 1024 .)

0

Source: https://habr.com/ru/post/1714346/


All Articles