Parsing XML / XHTML with Regex

Question

Parsing XML / XHTML with Regex

I read the famous post. I have seen attempts, both with limited success and with failure. Oh, fiery wars, both here and elsewhere.

But it can be done.

Although I know that the actual argument (the fact of reading) is that regular expressions are not suitable for parsing structured data trees because of their inability to control and change state, I feel that some blindly discard the possibility. Application logic is required to maintain state, but as this working example shows, this can be done.

Below is the corresponding snippet:

const PARSE_MODE_NEXT = 0;
const PARSE_MODE_ELEMENT = 1;
const PARSE_MODE_ENTITY = 3;
const PARSE_MODE_COMMENT = 4;
const PARSE_MODE_CDATA = 5;
const PARSE_MODE_PROC = 6;

protected $_parseModes = array(
        self::PARSE_MODE_NEXT     => '% < (?: (?: (?<entity>!) (?: (?<comment>--) | (?<cdata>\[CDATA\[) ) ) | (?<proc>\?) )? %six',
        self::PARSE_MODE_ELEMENT  => '% (?<close>/)? (?<element> .*? ) (?<empty> / )? > (?<text> [^<]* ) %six',
        self::PARSE_MODE_ENTITY   => '% (?<entity> .*? ) > (?<text> [^<]* ) %six',
        self::PARSE_MODE_COMMENT  => '% (?<comment> .*? ) --> (?<text> [^<]* ) %six',
        self::PARSE_MODE_CDATA    => '% (?<cdata> .*? ) \]\]> (?<text> [^<]* ) %six',
        self::PARSE_MODE_PROC     => '% (?<proc> .*? ) \?> (?<text> [^<]* ) %six',
    );

public function load($string){
    $parseMode = self::PARSE_MODE_NEXT;
    $parseOffset = 0;
    $context = $this;
    while(preg_match($this->_parseModes[$parseMode], $string, $match, PREG_OFFSET_CAPTURE, $parseOffset)){
        if($parseMode == self::PARSE_MODE_NEXT){
            switch(true){
                case (!($match['entity'][0] || $match['comment'][0] || $match['cdata'][0] || $match['proc'][0])):
                    $parseMode = self::PARSE_MODE_ELEMENT;
                    break;
                case ($match['proc'][0]):
                    $parseMode = self::PARSE_MODE_PROC;
                    break;
                case ($match['cdata'][0]):
                    $parseMode = self::PARSE_MODE_CDATA;
                    break;
                case ($match['comment'][0]):
                    $parseMode = self::PARSE_MODE_COMMENT;
                    break;
                case ($match['entity'][0]):
                    $parseMode = self::PARSE_MODE_ENTITY;
                    break;
            }
        }else{
            switch($parseMode){
                case (self::PARSE_MODE_ELEMENT):
                    switch(true){
                        case (!($match['close'][0] || $match['empty'][0])):
                            $context = $context->addChild(new ZuqMLElement($match['element'][0]));
                            break;
                        case ($match['empty'][0]):
                            $context->addChild(new ZuqMLElement($match['element'][0]));
                            break;
                        case ($match['close'][0]):
                            $context = $context->_parent;
                            break;
                    }
                    break;
                case (self::PARSE_MODE_ENTITY):
                    $context->addChild(new ZuqMLEntity($match['entity'][0]));
                    break;
                case (self::PARSE_MODE_COMMENT):
                    $context->addChild(new ZuqMLComment($match['comment'][0]));
                    break;
                case (self::PARSE_MODE_CDATA):
                    $context->addChild(new ZuqMLCharacterData($match['cdata'][0]));
                    break;
                case (self::PARSE_MODE_PROC):
                    $context->addChild(new ZuqMLProcessingInstruction($match['proc'][0]));
                    break;
            }
            $parseMode = self::PARSE_MODE_NEXT;
        }
        if(trim($match['text'][0])){
            $context->addChild(new ZuqMLText($match['text'][0]));
        }
        $parseOffset = $match[0][1] + strlen($match[0][0]);
    }

}

Are you done? No.

Is it really impossible? Of course not.

Is it fast? Not tested, but I can’t imagine it as fast as DOM.

XPath/XQuery? , .

- ? .

DOM? .

, ?

<?xml version="1.0" encoding="utf-8"?>
<!ENTITY name="value">
<root>
    <node>
        <node />
        Foo
        <node name="value">
            <node>Bar</node>
        </node>
        <!-- Comment -->
    </node>
    <node>
        <[CDATA[ Character Data ]]>
    </node>
</root>

. , .

, Community Wiki, , , .

, - , , ? , .

" ", .

, , SimpleXML , DOM .

+3

dom xml php regex

Dan Lugg 25 . '11 5:30

1

Michael Kay · Accepted Answer · 2011-01-25T09:54:45+0000

, - , , ? XML, XML XML ?

, , XML, , XML-, -XML- , . , , , , , , , , , HTML, , .

PHP, , XML. - XML- ,

Parsing XML / XHTML with Regex

More articles: