PHP: tokenization using regex (mostly there)

I want to tokenize formatting strings (very rude like printf), and I think I just skip a small bit:

  • % [number] [one letter ctYymd] will become a token²
  • $ 1 ... $ 10 will become a token
  • everything else (plain text) becomes a token.

I have pretty far in regExp simulator . It looks like this:

²update: now instead of # instead of #. (Less problems with Windows command line options)

enter image description here

It isn’t scary if you are focused on three parts connected by pipes (somehow), so basically it’s only three matches. Since I want to match from start to finish, I wrapped things in /^...%/ and /^...%/ surrounded by a mismatch group (?:... that can be repeated 1 or more times:

 $exp = '/^(?:(%\\d*[ctYymd]+)|([^$%]+)|(\\$\\d))+$/'; 

My source still does not deliver:

 $exp = '/^(?:(%\\d*[ctYymd]+)|([^$%]+)|(\\$\\d))+$/'; echo "expression: $exp \n"; $tests = [ '###%04d_Ball0n%02d$1', '%03d_Ball0n%02x$1%03d_Ball0n%02d$1', '%3d_Ball0n%02d', ]; foreach ( $tests as $test ) { echo "teststring: $test\n"; if( preg_match( $exp, $test, $tokens) ) { array_shift($tokens); foreach ( $tokens as $token ) echo "\t\t'$token'\n"; } else echo "not valid."; } // foreach 

I get results, but: matches do not work. The first% [number] [letter] never matches, so the others correspond to a double:

 expression: /^((%\d*[ctYymd]+)|([^$%]+)|(\$\d))+$/ teststring: ###%04d_Ball0n%02d$1 '$1' '%02d' '_Ball0n' '$1' teststring: %03d_Ball0n%02x$1%03d_Ball0n%02d$1 not valid.teststring: %3d_Ball0n%02d '%02d' '%02d' '_Ball0n' teststring: %d_foobardoo '_foobardoo' '%d' '_foobardoo' teststring: Ball0n%02dHamburg%d '%d' '%d' 'Hamburg' 
+5
source share
1 answer

Solution (edited by OP): I use two small options (only for "wrapping"): first for verification, and then for tokenization:

 #\d*[ctYymd]+|\$\d+|[^#\$]+ 

RegEx Demo

Code:

 $core = '#\d*[ctYymd]+|\$\d+|[^#\$]+'; $expValidate = '/^('.$core.')+$/m'; $expTokenize = '/('.$core.')/m'; $tests = [ '#3d-', '#3d-ABC', '***#04d_Ball0n#02d$1', '#03d_Ball0n#02x$AwrongDollar', '#3d_Ball0n#02d', 'Badstring#02xWrongLetterX' ]; foreach ( $tests as $test ) { echo "teststring: [$test]\n"; if( ! preg_match_all( $expValidate, $test) ) { echo "not valid.\n"; continue; } if( preg_match_all( $expTokenize, $test, $tokens) ) { foreach ( $tokens[0] as $token ) echo "\t\t'$token'\n"; } } // foreach 

Output:

 teststring: [#3d-] '#3d' '-' teststring: [#3d-ABC] '#3d' '-ABC' teststring: [***#04d_Ball0n#02d$1] '***' '#04d' '_Ball0n' '#02d' '$1' teststring: [#03d_Ball0n#02x$AwrongDollar] not valid. teststring: [#3d_Ball0n#02d] '#3d' '_Ball0n' '#02d' teststring: [Badstring#02xWrongLetterX] not valid. 
+2
source

Source: https://habr.com/ru/post/1234472/


All Articles