This is a noob question from someone who hasn't written a parser / lexer yet.
I am writing a CSS tokenizer / parser in PHP (please do not repeat with "OMG, why in PHP?"). The syntax is written neatly by W3C here (CSS2.1) and here (CSS3, draft) .
This is a list of 21 possible tokens that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing 21 patterns over and over, execute if (preg_match()) and reduce the matching of the original string by coincidence. In principle, this works very well. However, for a line of 1000 CSS lines, it takes 2 to 8 seconds, which is too much for my project.
Now Iām smashing my head how other parsers marx and parse CSS in fractions of a second. OK, C is always faster than PHP, but, nevertheless, are there any obvious D'Oh! what did i fall?
I made some optimizations, such as checking "@", "#" or "" as the first char of the remaining line and applying only the appropriate regexp, but this did not bring big performance improvements.
My code (snippet):
$TOKENS = array( 'IDENT' => '...regexp...', 'ATKEYWORD' => '@...regexp...', 'String' => '"...regexp..."|\'...regexp...\'', //... ); $string = '...CSS source string...'; $stream = array(); // we reduce $string token by token while ($string != '') { $string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the // start is insignificant but doing a trim reduces exec time by 25% $matches = array(); // loop through all possible tokens foreach ($TOKENS as $t => $p) { // The '&' is used as delimiter, because it isn't used anywhere in // the token regexps if (preg_match('&^'.$p.'&Su', $string, $matches)) { $stream[] = array($t, $matches[0]); $string = substr($string, strlen($matches[0])); // Yay! We found one that matches! continue 2; } } // if we come here, we have a syntax error and handle it somehow } // result: an array $stream consisting of arrays with // 0 => type of token // 1 => token content