DOMDocument appendXML with special characters

I am returning some html lines from my database, and I would like to parse these lines in my DOMDocument. The problem is that the DOMDocument provides warnings for special characters.

Warning: DOMDocumentFragment :: appendXML () [Domdocumentfragment.appendxml]: Object: line 2: parser error: object 'nbsp' is not defined in page.php on line 189

I wonder why I wonder how to solve this. These are some code snippets of my page. How can I fix these warnings?

$doc = new DOMDocument(); // .. create some elements first, like some divs and a h1 .. while($row = mysql_fetch_array($result)) { $messageEl = $doc->createDocumentFragment(); $messageEl->appendXML($row['message']); // gives it warnings here! $otherElement->appendChild($messageEl); } echo $doc->saveHTML(); 

I also found something about validation, but applying this, my page no longer loads. The code I tried for this was something like this.

 $implementation = new DOMImplementation(); $dtd = $implementation->createDocumentType('html','-//W3C//DTD XHTML 1.0 Transitional//EN','http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'); $doc = $implementation->createDocument('','',$dtd); $doc->validateOnParse = true; $doc->formatOutput = true; // in the same whileloop, I used the following: $messageEl = $doc->createDocumentFragment(); $doc->validate(); // which stopped my code, but error- and warningless. $messageEl->appendXml($row['message']); 

Thanks in advance!

+1
source share
5 answers

There is no   in XML . The only character entities that have an actual name defined (instead of using a numeric reference) are & , < , > , " and ' .

This means that you need to use the numerical equivalent of a non-breaking space, which is   or (in hexadecimal format)   .

If you are trying to save HTML in an XML container, save it as text. HTML and XML may look similar, but they are very different. appendXML() expects a valid XML expression as an argument. Use the nodeValue property nodeValue , it will encode the HTML HTML string without any warnings.

 // document fragment is completely unnecessary $otherElement->nodeValue = $row['message']; 
+6
source

This is difficult because there are actually several problems in one.

As Tomalak points out, XML doesn't have   . So you did the right thing by specifying DOMImplementation, because XHTML has   . But for the DOM to know that the XHTML document, you have the download and validation for DTD. DTD is in

 http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd 

but due to the fact that millions of requests arrive on this page every day, W3C decided to block access to the page if there is no UserAgent sent in the request. To provide a UserAgent, you must create a custom thread context.

In code:

 // make sure DOM passes a User Agent when it fetches the DTD libxml_set_streams_context( stream_context_create( array( 'http' => array( 'user_agent' => 'PHP libxml agent', ) ) ) ); // specify the implementation $imp = new DOMImplementation; // create a DTD (here: for XHTML) $dtd = $imp->createDocumentType( 'html', '-//W3C//DTD XHTML 1.0 Transitional//EN', 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd' ); // then create a DOMDocument with the configured DTD $dom = $imp->createDocument(NULL, "html", $dtd); $dom->encoding = 'UTF-8'; $dom->validate(); $fragment = $dom->createDocumentFragment(); $fragment->appendXML(' <head><title>XHTML test</title></head> <body><p>Some text with a &nbsp; entity</p></body> ' ); $dom->documentElement->appendChild($fragment); $dom->formatOutput = TRUE; echo $dom->saveXml(); 

It will take some time (don't ask me why), but in the end you will get (reformatted for SO)

 <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>XHTML test</title> </head> <body> <p>Some text with a &nbsp; entity</p> </body> </html> 

See also the DOMDocument :: validate () problem

+5
source

I see the problem in question, as well as the fact that this question was answered, but if I can offer an idea from my past regarding similar problems.

It may just be necessary for your task to include tagged data from the database in the resulting XML, but it may or may not require parsing. If this is just data to include and not structured parts of your XML, you can put the rows from the database into CDATA partitions , effectively bypassing all validation errors at this point.

0
source

Here's a different approach, because we did not want possibly slow network requests (or any network requests in general as a result of user input):

 <?php $document = new \DOMDocument(); $document->loadHTML('<html><body></body></html>'); $html = '<b>test&nbsp;</b>'; $fragment = $document->createDocumentFragment(); $html = '<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE document [ <!ENTITY nbsp "&#160;" > ]> <document>'.$html.'</document>'; $newdom = new \DOMDocument(); $newdom->loadXML($html, LIBXML_HTML_NOIMPLIED | LIBXML_NOCDATA | LIBXML_NOENT | LIBXML_NONET | LIBXML_NOBLANKS); foreach ($newdom->documentElement->childNodes as $childnode) $fragment->appendChild($fragment->ownerDocument->importNode($childnode, TRUE)); $document->getElementsByTagName('body')[0]->appendChild($fragment); echo $document->saveHTML(); 

Here we include the relevant part of the DTD , in particular, the definition of latin1 as an internal definition of DOCTYPE. Then the HTML content is placed in the document element to be able to process the sequence of child elements. Then the analyzed nodes are imported and added to the target DOM.

Our actual implementation uses file_get_contents to load a DTD containing all entity definitions from a local file.

0
source

Although smarty may be a good bet (why reinvent the wheel for the 14th time?), The etranger may have a point. There are situations in which you do not want to use something overkill, like a complete new (and unexplored) package, but rather that you want to publish some data from a database that just contains the html files that XML encountered parser.

A warning. The following is a simple solution, but do not do it if you are NOT SURE that you can get away from it! (I did this when I had about 2 hours before the deadline and didnโ€™t have time to study, leave a lone tool of something like smart ...)

Before inserting a string into the appendXML function, run it through preg_replace. For example, replace all and nbsp; characters with [some_prefix] _nbsp. Then, on the page where you display html, do the opposite.

And Presto! =)

Code example: Code that places text in a document fragment:

 // add text tag to p tag. // print("CCMSSelTextBody::getDOMObject: strText: ".$this->m_strText."<br>\n"); $this->m_strText = preg_replace("/&nbsp;/", "__nbsp__", $this->m_strText); $domTextFragment = $domDoc->createDocumentFragment(); $domTextFragment->appendXML(utf8_encode($this->m_strText)); $p->appendChild($domTextFragment); // $p->appendChild(new DOMText(utf8_encode($this->m_strText))); 

Code that parsed the string and wrote html:

 // Instantiate template. $pTemplate = new CTemplate($env, $pageID, $pUser, $strState); // Parse tag-sets. $pTemplate->parseTXTTags(); $pTemplate->parseCMSTags(); // present the html code. $html = $pTemplate->getPageHTML(); $html = preg_replace("/__nbsp__/", "&nbsp;", $html); print($html); 

It's probably a good idea to come up with a stronger replacement. (If you insist on being careful: do md5 at time (), and hardcode is the result of this as a prefix. So in the first snippet:

 $this->m_strText = preg_replace("/&nbsp;/", "4597ee308cd90d78aa4655e76bf46ee0_nbsp", $this->m_strText); 

And in the second:

 $html = preg_replace("/4597ee308cd90d78aa4655e76bf46ee0_nbsp/", "&nbsp;", $html); 

Do the same for any other tags and things you need to get around.

This is a hack, not a good code for any part of the imagination. But it saved my life and wanted to share it with other people who are faced with this problem with minimal time.

Use the above at your own risk.

-1
source

Source: https://habr.com/ru/post/1433646/


All Articles