PCRE: Lazy and Greedy at the same time (Possessive Quantifier)

I'm trying to match a series of text strings with PCRE in PHP, and it's hard for me to get all the matches between the first and second.

If anyone wonders why on Earth I would like to do this, this is due to Doc comments. Oh, how I want Zend to force native / plugin functions to read Doc comments from a PHP file ...

The following example (plain) text will be used for the problem. It will always be pure PHP code, with only one opening tag at the beginning of the file, without closing. You can assume that the syntax will always be correct.

<?php class someClass extends someExample { function doSomething($someArg = 'someValue') { // Nested code blocks... if($boolTest){} } private function killFurbies(){} protected function runSomething(){} } abstract class anotherClass { public function __construct(){} abstract function saveTheWhales(); } function globalFunc(){} 

Problem

Trying to match all methods in a class; my RegEx does not find the killFurbies() method at all. If he is greedy, this means that he corresponds only to the last method in the class and allows him to be lazy, since he corresponds only to the first method.

 $part = '.*'; // Greedy $part = '.*?'; // Lazy $regex = '%class(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff]*)' . '.*?\{' . $part .'(?:(public|protected|private)(?:\\n|\\r|\\s)+)?' . 'function(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff' . ']*)(?:\\n|\\r|\\s)*\\(%ms'; preg_match_all($regex, file_get_contents(__EXAMPLE__), $matches, PREG_SET_ORDER); var_dump($matches); 

Results in:

 // Lazy: array(2) { [0]=> array(4) { [0]=> // Omitted. [1]=> string(9) "someClass" [2]=> string(0) "" [3]=> string(11) "doSomething" } [1]=> array(4) { [0]=> // Omitted. [1]=> string(12) "anotherClass" [2]=> string(6) "public" [3]=> string(11) "__construct" } } // Greedy: array(2) { [0]=> array(4) { [0]=> // Omitted. [1]=> string(9) "someClass" [2]=> string(0) "" [3]=> string(13) "saveTheWhales" } [1]=> array(4) { [0]=> // Omitted. [1]=> string(12) "anotherClass" [2]=> string(0) "" [3]=> string(13) "saveTheWhales" } } 

How do I match everything ?: S

Any help would be greatly appreciated, as I already feel that this question is ridiculous when I print it. Anyone who is trying to answer such a question is bolder than me!

+1
source share
3 answers

It is better to use token_get_all to get the tokens of the PHP code and repeat them. comments in the style of PHPDoc tokens can be identified using T_DOC_COMMENT .

0
source

Err, you can’t just parse the source using token_get_all and look for markers like T_DOC_COMMENT (changed from T_COMMENT to T_DOC_COMMENT, see Gumnbo message)?

An example of using token_get_all can be found here .

0
source

Decision

I came up with a class to extract Doc comments for classes and methods in a file. Thanks to everyone who answered this question, and the other for matching code blocks .

The average benchmarks for the following example are from 0.00495 to 0.00505 seconds.

 <?php $file = 'path/to/libraries/tokenizer.php'; include $file; $tokenizer = new Tokenizer; // Start Benchmarking here. $tokenizer->load($file); // End Benchmarking here. // The following will output 'bool(false)'. var_dump($tokenizer->get_doc('Tokenizer', 'get_tokens')); // The following will output 'string(18) "/** load method */"'. 

Tokenizer (yes, I still haven't thought about a better name for it ...) Class:

 <?php class Tokenizer { private $compiled = false, $path = false, $tokens = false, $classes = array(); /** load method */ public function load($path) { $path = realpath($path); if(!file_exists($path) || !function_exists('token_get_all')) { return false; } $this->compiled = false; $this->classes = array(); $this->path = $path; $this->tokens = false; $this->get_tokens(); $this->get_classes(); $this->class_blocks(); $this->class_functions(); return true; } protected function get_tokens() { $tokens = token_get_all(file_get_contents($this->path)); $compiled = ''; foreach($tokens as $k => $t) { if(is_array($t) && $t[0] != T_WHITESPACE) { $compiled .= $k . ':' . $t[0] . ','; } else { if($t == '{' || $t == '}') { $compiled .= $t . ','; } } } $this->tokens = $tokens; $this->compiled = trim($compiled, ','); } protected function get_classes() { if(!$this->compiled) { return false; } $regex = '%(?:(\\d+)\\:366,)?(?:\\d+\\:(?:345|344|353),)?\\d+\\:352,(\\d+)\\:307,(?:\\d+\\:(?:354|355),\\d+\\:307,)*{%'; preg_match_all($regex, $this->compiled, $classes, PREG_SET_ORDER); if(is_array($classes)) { foreach($classes as $class) { $this->classes[$this->tokens[$class[2]][1]] = array('token' => $class[2]); $this->classes[$this->tokens[$class[2]][1]]['doc'] = isset($this->tokens[$class[1]][1]) ? $this->tokens[$class[1]][1] : false; } } } private function class_blocks() { if(!$this->compiled) { return false; } foreach($this->classes as $class_name => $class) { $this->classes[$class_name]['block'] = $this->get_block($class['token']); } } protected function get_block($name_token) { if(!$this->compiled || ($pos = strpos($this->compiled, $name_token . ':')) === false) { return false; } $section= substr($this->compiled, $pos); $len = strlen($section); $block = ''; $opening = 1; $closing = 0; for($i = 0; $i < $len; $i++) { if($section[$i] == '{') { $opening++; } elseif($section[$i] == '}') { $closing++; if($closing == $opening) { break; } } if($opening > 0) { $block .= $section[$i]; } } return trim($block, ','); } protected function class_functions() { if(!$this->compiled) { return false; } foreach($this->classes as $class_name => $class) { $regex = '%(?:(\d+)\:366,)?(?:\d+\:(?:344|345),)?(?:\d+\:(?:341|342|343),)?\d+\:333,(\d+)\:307,\{%'; preg_match_all($regex, $class['block'], $functions, PREG_SET_ORDER); foreach($functions as $function) { $function_name = $this->tokens[$function[2]][1]; $this->classes[$class_name]['functions'][$function_name] = array('token' => $function[2]); $this->classes[$class_name]['functions'][$function_name]['doc'] = isset($this->tokens[$function[1]][1]) ? $this->tokens[$function[1]][1] : false; $this->classes[$class_name]['functions'][$function_name]['block'] = $this->get_block($function[2]); } } } public function get_doc($class, $function = false) { if(!is_string($class) || !isset($this->classes[$class])) { return false; } if(!is_string($function)) { return $this->classes[$class]['doc']; } else { if(!isset($this->classes[$class]['functions'][$function])) { return false; } return $this->classes[$class]['functions'][$function]['doc']; } } } 

Any thoughts or comments on this? Any criticism is welcome!

Thanks, mniz.

0
source

Source: https://habr.com/ru/post/1302614/


All Articles