How to check CDATA section for XML in PHP

I am creating XML based on user input. One of the xml nodes has a CDATA section. If one of the characters inserted in the CDATA section is "special" (as it seems to me, the control character), then all xml becomes invalid.

Example:

$dom = new DOMDocument('1.0', 'utf-8'); $dom->appendChild($dom->createElement('root')) ->appendChild($dom->createCDATASection( "This is some text with a SOH char \x01." )); $test = new DOMDocument; $test->loadXml($dom->saveXML()); echo $test->saveXml(); 

will give

 Warning: DOMDocument::loadXML(): CData section not finished This is some text with a SOH cha in Entity, line: 2 in /newfile.php on line 17 Warning: DOMDocument::loadXML(): PCDATA invalid Char value 1 in Entity, line: 2 in /newfile.php on line 17 Warning: DOMDocument::loadXML(): Sequence ']]>' not allowed in content in Entity, line: 2 in /newfile.php on line 17 Warning: DOMDocument::loadXML(): Sequence ']]>' not allowed in content in Entity, line: 2 in /newfile.php on line 17 Warning: DOMDocument::loadXML(): internal errorExtra content at the end of the document in Entity, line: 2 in /newfile.php on line 17 <?xml version="1.0"?> 

Is there a good way in php to make sure the CDATA partition is valid?

+4
source share
4 answers

The allowed character range for the CDATA section

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

So, you must misinform your string to include only those characters.

+8
source

Because "\ x01" is not a printable character. Therefore, they cause a warning. You can solve this problem as follows:

 $dom = new DOMDocument('1.0', 'utf-8'); $dom->appendChild($dom->createElement('root')) ->appendChild($dom->createCDATASection( urlencode("This is some text with a SOH char \x01.") )); $test = new DOMDocument; $test->loadXml($dom->saveXML()); echo urldecode($test->saveXml()); 
+2
source

Using Gordon's answer, I did:

  /** * Removes invalid characters from an HTML string * * @param string $content * * @return string */ function sanitize_html($content) { if (!$content) return ''; $invalid_characters = '/[^\x9\xa\x20-\xD7FF\xE000-\xFFFD]/'; return preg_replace($invalid_characters, '', $content); } 

Use as:

+1
source

Take a look at simplexml_load_file ( http://php.net/manual/en/function.simplexml-load-file.php ) LIBXML_NOCDATA ( http://www.php.net/manual/en/libxml.constants.php ). This will most likely answer your question.

-1
source

Source: https://habr.com/ru/post/1384919/


All Articles