Why does xpath remove html special characters?
why this
$html = '<a href="/browse/product.do?cid=1&vid=1&pid=1" class="productItemName">what is going on here</a>'; $dom = new DOMDocument(); $dom->loadhtml($html); $xpath = new DOMXPath($dom); $selectors['link'] = '//a/@href'; $links_nodeList = $xpath->query($selectors['link']); foreach ($links_nodeList as $link) { $links[] = $link->nodeValue; } echo("<p>links</p>"); echo("<pre>"); print_r($links); echo("</pre>"); Output
links Array ( [0] => /browse/product.do?cid=1&vid=1&pid=1 ) but not
links Array ( [0] => /browse/product.do?cid=1&vid=1&pid=1 ) ?
The answer is simple :
& is a special way to represent the "&" character in an XML document.
These two characters designate the same character .
When the escaped form of an ampersand is displayed as text (and not as XML), it means that it is "&" .
As @LarsH described in detail in his comment :
when you say
loadhtml($html)you parse the string as HTML, which means that character objects (like&) are interpreted into the characters they represent (like&). If you need a string that will be interpreted as&you need to avoid an ampersand like&amp;