Tokenizing CSS Performance in PHP

Question

Tokenizing CSS Performance in PHP

This is a noob question from someone who hasn't written a parser / lexer yet.

I am writing a CSS tokenizer / parser in PHP (please do not repeat with "OMG, why in PHP?"). The syntax is written neatly by W3C here (CSS2.1) and here (CSS3, draft) .

This is a list of 21 possible tokens that all (but two) cannot be represented as static strings.

My current approach is to loop through an array containing 21 patterns over and over, execute if (preg_match()) and reduce the matching of the original string by coincidence. In principle, this works very well. However, for a line of 1000 CSS lines, it takes 2 to 8 seconds, which is too much for my project.

Now I’m smashing my head how other parsers marx and parse CSS in fractions of a second. OK, C is always faster than PHP, but, nevertheless, are there any obvious D'Oh! what did i fall?

I made some optimizations, such as checking "@", "#" or "" as the first char of the remaining line and applying only the appropriate regexp, but this did not bring big performance improvements.

My code (snippet):

 $TOKENS = array( 'IDENT' => '...regexp...', 'ATKEYWORD' => '@...regexp...', 'String' => '"...regexp..."|\'...regexp...\'', //... ); $string = '...CSS source string...'; $stream = array(); // we reduce $string token by token while ($string != '') { $string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the // start is insignificant but doing a trim reduces exec time by 25% $matches = array(); // loop through all possible tokens foreach ($TOKENS as $t => $p) { // The '&' is used as delimiter, because it isn't used anywhere in // the token regexps if (preg_match('&^'.$p.'&Su', $string, $matches)) { $stream[] = array($t, $matches[0]); $string = substr($string, strlen($matches[0])); // Yay! We found one that matches! continue 2; } } // if we come here, we have a syntax error and handle it somehow } // result: an array $stream consisting of arrays with // 0 => type of token // 1 => token content

+4

performance php parsing token lexer

Boldewyn Apr 9 '10 at 18:53

source share

5 answers

The first thing I would do is get rid of preg_match() . Basic string functions like strpos() are much faster, but I don't think you even need this. It looks like you are looking for a specific token at the beginning of a line with preg_match() , and then just take the front length of that line as a substring. You can easily accomplish this with a simple substr() , for example:

 foreach ($TOKENS as $t => $p) { $front = substr($string,0,strlen($p)); $len = strlen($p); //this could be pre-stored in $TOKENS if ($front == $p) { $stream[] = array($t, $string); $string = substr($string, $len); // Yay! We found one that matches! continue 2; } }

You can further optimize this by pre-calculating the length of all your tokens and storing them in the $TOKENS array so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could also reduce the number of substr() calls, since you could take substr($string) current string, parsed only once for each token length, and skip all tones of that length before moving on to the next group of tokens.

0

zombat Apr 9 '10 at 19:30

source share

A (possibly) faster (but less memory friendly) approach would be to tokenize the entire stream at once using one large regular expression with alternatives for each token, e.g.

  preg_match_all('/ (...string...) | (@ident) | (#ident) ...etc /x', $stream, $tokens); foreach($tokens as $token)...parse

0

user187291 Apr 9 '10 at 21:42

source share

Do not use regexp, scan character by character.

 $tokens = array(); $string = "...code..."; $length = strlen($string); $i = 0; while ($i < $length) { $buf = ''; $char = $string[$i]; if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) { while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) { // identifier $buf .= $char; $char = $string[$i]; $i ++; } $tokens[] = array('IDENT', $buf); } else if (......) { // ...... } }

However, this makes the code unreachable, so the parser is better.

0

Ming-tang Apr 10 '10 at 22:40

source share

This is an old post, but still contributed to my 2 cents on it. one thing that seriously slows down the source code in the question is the following line:

 $string = substr($string, strlen($matches[0]));

instead of working on the entire line, take only part of it (for example, 50 characters), which is enough for all possible regular expressions. then apply the same line of code to it. when this line is compressed below a given length, load some more data into it.

0

Nir 15 sept. '13 at 2:03

source share

erikkallen · Accepted Answer · 2010-04-09T20:08:22+0000

Use the lexer generator .

Tokenizing CSS Performance in PHP

More articles: