C Tokenizer - How does it work?

How it works?

I know how to use it that you pass in:

  • start: line (for example, "Clause 1, Clause 2, Clause 3")
  • delim: delimiter string (for example, ",")
  • tok: link to the string that will hold the token
  • nextpos (optional): reference to the position in the source line where the next token begins
  • sdelim (optional): a pointer to the character that will hold the initial token divider
  • edelim (optional): pointer to the character that will hold the final token divider

the code:

#include <stdlib.h>
#include <string.h>

int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
    // Find beginning:
    int len = 0;
    char *scanner;
    int dictionary[8];
    int ptr;

    for(ptr = 0; ptr < 8; ptr++) {
        dictionary[ptr] = 0;
    }

    for(; *delim; delim++) {
        dictionary[*delim / 32] |= 1 << *delim % 32;
    }

    if(sdelim) {
        *sdelim = 0;
    }

    for(; *start; start++) {
        if(!(dictionary[*start / 32] & 1 << *start % 32)) {
            break;
        }
        if(sdelim) {
            *sdelim = *start;
        }
    }

    if(*start == 0) {
        if(nextpos != NULL) {
            *nextpos = start;
        }
        *tok = NULL;
        return 0;
    }

    for(scanner = start; *scanner; scanner++) {
        if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
            break;
        }
        len++;
    }

    if(edelim) {
        *edelim = *scanner;
    }

    if(nextpos != NULL) {
        *nextpos = scanner;
    }

    *tok = (char*)malloc(sizeof(char) * (len + 1));

    if(*tok == NULL) {
        return 0;
    }

    memcpy(*tok, start, len);
    *(*tok + len) = 0;


    return len + 1;
}

I get most of it, except:

dictionary[*delim / 32] |= 1 << *delim % 32;

and

dictionary[*start / 32] & 1 << *start % 32

This is magic?

+3
source share
3 answers

, , 8 x 32 = 256, .

dictionary[*delim / 32] |= 1 << *delim % 32;

, * delim

dictionary[*start / 32] & 1 << *start % 32

+1

8 (sizeof(char) == 1 ), 256 .

8 (int dictionary[8]), 32 (sizeof(int) is >= 4 ) 32 * 8 = 256.

256- . (dictionary[*delim / 32] |= 1 << *delim % 32;). *delim / 32 ASCII , 32. ASCII 0 255, 0 7 . - , .

, 256- true, ASCII .

, , 256- (dictionary[*start / 32] & 1 << *start % 32)

+4

OK, so if we send a string ","to delimiter, then it dictionary[*delim / 32] |= 1 << *delim % 32will dictionary[1] = 4096. The expression dictionary[*start / 32] & 1 << *start % 32just checks the corresponding character.

What puzzles me, why they do not use direct comparison char.

0
source

Source: https://habr.com/ru/post/1757362/


All Articles