I read the famous post. I have seen attempts, both with limited success and with failure. Oh, fiery wars, both here and elsewhere.
But it can be done.
Although I know that the actual argument (the fact of reading) is that regular expressions are not suitable for parsing structured data trees because of their inability to control and change state, I feel that some blindly discard the possibility. Application logic is required to maintain state, but as this working example shows, this can be done.
Below is the corresponding snippet:
const PARSE_MODE_NEXT = 0;
const PARSE_MODE_ELEMENT = 1;
const PARSE_MODE_ENTITY = 3;
const PARSE_MODE_COMMENT = 4;
const PARSE_MODE_CDATA = 5;
const PARSE_MODE_PROC = 6;
protected $_parseModes = array(
self::PARSE_MODE_NEXT => '% < (?: (?: (?<entity>!) (?: (?<comment>--) | (?<cdata>\[CDATA\[) ) ) | (?<proc>\?) )? %six',
self::PARSE_MODE_ELEMENT => '% (?<close>/)? (?<element> .*? ) (?<empty> / )? > (?<text> [^<]* ) %six',
self::PARSE_MODE_ENTITY => '% (?<entity> .*? ) > (?<text> [^<]* ) %six',
self::PARSE_MODE_COMMENT => '% (?<comment> .*? ) --> (?<text> [^<]* ) %six',
self::PARSE_MODE_CDATA => '% (?<cdata> .*? ) \]\]> (?<text> [^<]* ) %six',
self::PARSE_MODE_PROC => '% (?<proc> .*? ) \?> (?<text> [^<]* ) %six',
);
public function load($string){
$parseMode = self::PARSE_MODE_NEXT;
$parseOffset = 0;
$context = $this;
while(preg_match($this->_parseModes[$parseMode], $string, $match, PREG_OFFSET_CAPTURE, $parseOffset)){
if($parseMode == self::PARSE_MODE_NEXT){
switch(true){
case (!($match['entity'][0] || $match['comment'][0] || $match['cdata'][0] || $match['proc'][0])):
$parseMode = self::PARSE_MODE_ELEMENT;
break;
case ($match['proc'][0]):
$parseMode = self::PARSE_MODE_PROC;
break;
case ($match['cdata'][0]):
$parseMode = self::PARSE_MODE_CDATA;
break;
case ($match['comment'][0]):
$parseMode = self::PARSE_MODE_COMMENT;
break;
case ($match['entity'][0]):
$parseMode = self::PARSE_MODE_ENTITY;
break;
}
}else{
switch($parseMode){
case (self::PARSE_MODE_ELEMENT):
switch(true){
case (!($match['close'][0] || $match['empty'][0])):
$context = $context->addChild(new ZuqMLElement($match['element'][0]));
break;
case ($match['empty'][0]):
$context->addChild(new ZuqMLElement($match['element'][0]));
break;
case ($match['close'][0]):
$context = $context->_parent;
break;
}
break;
case (self::PARSE_MODE_ENTITY):
$context->addChild(new ZuqMLEntity($match['entity'][0]));
break;
case (self::PARSE_MODE_COMMENT):
$context->addChild(new ZuqMLComment($match['comment'][0]));
break;
case (self::PARSE_MODE_CDATA):
$context->addChild(new ZuqMLCharacterData($match['cdata'][0]));
break;
case (self::PARSE_MODE_PROC):
$context->addChild(new ZuqMLProcessingInstruction($match['proc'][0]));
break;
}
$parseMode = self::PARSE_MODE_NEXT;
}
if(trim($match['text'][0])){
$context->addChild(new ZuqMLText($match['text'][0]));
}
$parseOffset = $match[0][1] + strlen($match[0][0]);
}
}
Are you done? No.
Is it really impossible? Of course not.
Is it fast? Not tested, but I canβt imagine it as fast as DOM.
XPath/XQuery? , .
- ? .
DOM? .
, ?
<?xml version="1.0" encoding="utf-8"?>
<!ENTITY name="value">
<root>
<node>
<node />
Foo
<node name="value">
<node>Bar</node>
</node>
</node>
<node>
<[CDATA[ Character Data ]]>
</node>
</root>
. , .
, Community Wiki, , , .
, - , , ? , .
" ", .
, , SimpleXML , DOM .