Creating an XML document in PHP (escape characters)

I am creating an XML document from a PHP script, and I need to avoid special XML characters. I know a list of characters that should be escaped; but what is the right way to do this?

Should characters be escaped using only a backslash (\) or the correct path? Is there a built-in PHP function that can handle this for me?

+43
xml php
Oct 18 2018-10-10T00:
source share
10 answers

Use the DOM classes to generate the entire XML document. It will handle encodings and decodings that we don’t even want to care about.




Edit: This was criticized by @Tchalvak:

The DOM object creates a complete XML document; it does not just lend itself to encoding its own string.

What's wrong, a DOMDocument can correctly output only a fragment, and not the entire document:

$doc->saveXML($fragment); 

which gives:

 Test &amp; <b> and encode </b> :) Test &amp;amp; &lt;b&gt; and encode &lt;/b&gt; :) 

how in:

 $doc = new DOMDocument(); $fragment = $doc->createDocumentFragment(); // adding XML verbatim: $xml = "Test &amp; <b> and encode </b> :)\n"; $fragment->appendXML($xml); // adding text: $text = $xml; $fragment->appendChild($doc->createTextNode($text)); // output the result echo $doc->saveXML($fragment); 

Watch Demo

+32
Oct 18 '10 at 8:00
source share

I created a simple function that escapes the five "predefined entities" that are in XML:

 function xml_entities($string) { return strtr( $string, array( "<" => "&lt;", ">" => "&gt;", '"' => "&quot;", "'" => "&apos;", "&" => "&amp;", ) ); } 

Demo usage example:

 $text = "Test &amp; <b> and encode </b> :)"; echo xml_entities($text); 

Output:

 Test &amp;amp; &lt;b&gt; and encode &lt;/b&gt; :) 

A similar effect can be achieved using str_replace , but it is fragile due to a double replacement (untested, not recommended):

 function xml_entities($string) { return str_replace( array("&", "<", ">", '"', "'"), array("&amp;", "&lt;", "&gt;", "&quot;", "&apos;"), $string ); } 
+35
Oct 18 '10 at 8:26
source share

What about the htmlspecialchars() function?

 htmlspecialchars($input, ENT_QUOTES | ENT_XML1, $encoding); 

Note. the ENT_XML1 flag is only available if you have PHP 5.4.0 or higher.

htmlspecialchars() with these parameters replaces the following characters:

  • & (ampersand) becomes &amp;
  • " (double quotation mark) becomes &quot;
  • ' (single quote) becomes &apos;
  • < (less) becomes &lt;
  • > (more) becomes &gt;

You can get the translation table using the get_html_translation_table() function.

+15
Feb 15 '13 at 7:46
source share

I could hardly cope with the XML entity problem, thus solving:

 htmlspecialchars($value, ENT_QUOTES, 'UTF-8') 
+11
Aug 04 2018-12-12T00:
source share

In order to have valid final XML text, you need to avoid all XML objects and have text written in the same encoding as the XML document processing instruction states this ("encoding" in the string <?xml ). Accented characters should not be escaped if they are encoded as a document.

However, in many situations, simply escaping input using htmlspecialchars can lead to double encoded objects (for example, &eacute; will become &amp;eacute; ), so I suggest decoding html objects first:

 function xml_escape($s) { $s = html_entity_decode($s, ENT_QUOTES, 'UTF-8'); $s = htmlspecialchars($s, ENT_QUOTES, 'UTF-8', false); return $s; } 

Now you need to make sure that all accented characters are valid in the encoding of the XML document. I highly recommend always coding XML output in UTF-8, as not all XML parsers respect the encoding of XML document processing. If your input can be obtained from a different encoding, try using utf8_encode() .

In this case, a special case that can be obtained from one of these encodings: ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252 and KOI8-R-PHP they are all the same, but in them there are slight differences, some of which even iconv() cannot handle it. I could solve this problem only by complementing the utf8_encode() behavior:

 function encode_utf8($s) { $cp1252_map = array( "\xc2\x80" => "\xe2\x82\xac", "\xc2\x82" => "\xe2\x80\x9a", "\xc2\x83" => "\xc6\x92", "\xc2\x84" => "\xe2\x80\x9e", "\xc2\x85" => "\xe2\x80\xa6", "\xc2\x86" => "\xe2\x80\xa0", "\xc2\x87" => "\xe2\x80\xa1", "\xc2\x88" => "\xcb\x86", "\xc2\x89" => "\xe2\x80\xb0", "\xc2\x8a" => "\xc5\xa0", "\xc2\x8b" => "\xe2\x80\xb9", "\xc2\x8c" => "\xc5\x92", "\xc2\x8e" => "\xc5\xbd", "\xc2\x91" => "\xe2\x80\x98", "\xc2\x92" => "\xe2\x80\x99", "\xc2\x93" => "\xe2\x80\x9c", "\xc2\x94" => "\xe2\x80\x9d", "\xc2\x95" => "\xe2\x80\xa2", "\xc2\x96" => "\xe2\x80\x93", "\xc2\x97" => "\xe2\x80\x94", "\xc2\x98" => "\xcb\x9c", "\xc2\x99" => "\xe2\x84\xa2", "\xc2\x9a" => "\xc5\xa1", "\xc2\x9b" => "\xe2\x80\xba", "\xc2\x9c" => "\xc5\x93", "\xc2\x9e" => "\xc5\xbe", "\xc2\x9f" => "\xc5\xb8" ); $s=strtr(utf8_encode($s), $cp1252_map); return $s; } 
+4
Feb 21 '13 at 15:57
source share

If you need the correct xml output, simplexml is the way:

http://www.php.net/manual/en/simplexmlelement.asxml.php

+2
Feb 21 '13 at 16:42
source share

Proper escaping is a way to get the right XML output, but you need to handle escaping differently for attributes and elements . (That is, Thomas's answer is incorrect).

I wrote / stole some Java code that is different from attribute and element escape elements. The reason is because the XML parser considers all special spaces, especially in attributes.

It should be trivial to port this to PHP (you can use the Thomas Janczyk approach with the above appropriate escaping). You do not need to worry about escaping extended objects if you use UTF-8 .

If you don't want to port my Java code, you can look at XMLWriter , which is stream-based and uses libxml, so it should be very efficient.

+1
Feb 21 '13 at 19:09
source share

You can use the following methods: http://php.net/manual/en/function.htmlentities.php

This way all objects (html / xml) are escaped and you can put your string in XML tags

0
Oct 18 '10 at 7:59
source share
  function replace_char($arr1) { $arr[]=preg_replace('>','&gt', $arr1); $arr[]=preg_replace('<','&lt', $arr1); $arr[]=preg_replace('"','&quot', $arr1); $arr[]=preg_replace('\'','&apos', $arr1); $arr[]=preg_replace('&','&amp', $arr1); return $arr; } 
0
May 29 '13 at 11:35
source share

Based on the sadeghj solution, the following code worked for me:

 /** * @param $arr1 the single string that shall be masked * @return the resulting string with the masked characters */ function replace_char($arr1) { if (strpos ($arr1,'&')!== FALSE) { //test if the character appears $arr1=preg_replace('/&/','&amp;', $arr1); // do this first } // just encode the if (strpos ($arr1,'>')!== FALSE) { $arr1=preg_replace('/>/','&gt;', $arr1); } if (strpos ($arr1,'<')!== FALSE) { $arr1=preg_replace('/</','&lt;', $arr1); } if (strpos ($arr1,'"')!== FALSE) { $arr1=preg_replace('/"/','&quot;', $arr1); } if (strpos ($arr1,'\'')!== FALSE) { $arr1=preg_replace('/\'/','&apos;', $arr1); } return $arr1; } 
0
Mar 05 '14 at 17:10
source share



All Articles