Doesn't PHP have functions for XML-safe entity decoding? Don't have xml_entity_decode?

PROBLEM: I need an XML file with full UTF8 encoding; that is, without an entity representing characters, all characters that fall under UTF8, with the exception of only three, which are reserved for XML, "&" (amp), "<" (lt) and ">" (gt). And I need a built-in function that does this quickly : to convert entities to real UTF8 characters (without corrupting my XML).
PS: this is a "real world problem" (!); on PMC / journals , for example, there are 2.8 MILLION scientific articles related to special XML DTD (also known as JATS format ) ... To process as a “regular XML-UTF8 text” we need to switch from a numeric object to a UTF8 char.

FIRST SOLUTION: The natural function for this task is html_entity_decode , but it destroys the XML code (!) By converting the reserved 3 XML reserved characters.

Illustrating the problem

Let's pretend that

$xmlFrag ='<p>Hello world! &#160;&#160; Let A&lt;B and A=&#x222C;dxdy</p>'; 

If the objects 160 (nbsp) and x222C (double integral) must be converted to UTF8 and the XML is reserved lt not. The XML text will be (after conversion)

$ xmlFrag = ' <p> Hello world! Let A &lt; B and A = ∬dxdy </p> ';

The text "A <B" needs an XML-preserving character, so MUST remain as A&lt;B


Disassembled Solutions

I'm trying to use html_entity_decode to solve (directly!) The problem ... So, I updated my PHP to version 5.5 to try to use the ENT_XML1 parameter.

  $s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working // as I expected 

Perhaps another question: “WHY is there no other way to do what I expected?” - This is important for many other XML applications (!), and not just for me.


I don't need a workaround as an answer ... Well, I am showing my ugly function, maybe this will help you understand the problem,

  function xml_entity_decode($s) { // here an illustration (by user-defined function) // about how the hypothetical PHP-build-in-function MUST work static $XENTITIES = array('&amp;','&gt;','&lt;'); static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;'); $s = str_replace($XENTITIES,$XSAFENTITIES,$s); //$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+ $s = str_replace($XSAFENTITIES,$XENTITIES,$s); return $s; } // you see? not need a benchmark: // it is not so fast as direct use of html_entity_decode; if there // was an XML-safe option was ideal. 

PS: fixed after this answer . There must be an ENT_HTML5 flag to convert truly all named objects .

+4
source share
5 answers

This question creates a “false answer” from time to time (see answers). Perhaps this is due to the fact that people do not pay attention, and because there is NO ANSWER: there is no built-in PHP solution .

... So, repeat my workaround (this is NOT the answer!) So as not to create more confusion:

The best workaround

Note:

  • The xml_entity_decode() function below is the best (compared to any other) workaround .
  • The function below is not the answer to this question , it is just a job.
  function xml_entity_decode($s) { // illustrating how a (hypothetical) PHP-build-in-function MUST work static $XENTITIES = array('&amp;','&gt;','&lt;'); static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;'); $s = str_replace($XENTITIES,$XSAFENTITIES,$s); $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+ $s = str_replace($XSAFENTITIES,$XENTITIES,$s); return $s; } 

To test and demonstrate that you have a better solution, first check out this simple benckmark:

  $countBchMk_MAX=1000; $xml = file_get_contents('sample1.xml'); // BIG and complex XML string $start_time = microtime(TRUE); for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){ $A = xml_entity_decode($xml); // 0.0002 /* 0.0014 $doc = new DOMDocument; $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT); $doc->encoding = 'UTF-8'; $A = $doc->saveXML(); */ } $end_time = microtime(TRUE); echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ", ($end_time - $start_time)/$countBchMk_MAX, " seconds</h1>"; 
+1
source

Use DTD when loading a JATS XML document, as it will detect any mapping from named objects to Unicode characters, and then set the encoding to UTF-8 when saving:

 $doc = new DOMDocument; $doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT); $doc->encoding = 'UTF-8'; $doc->save($outputFile); 
+2
source
  public function entity_decode($str, $charset = NULL) { if (strpos($str, '&') === FALSE) { return $str; } static $_entities; isset($charset) OR $charset = $this->charset; $flag = is_php('5.4') ? ENT_COMPAT | ENT_HTML5 : ENT_COMPAT; do { $str_compare = $str; // Decode standard entities, avoiding false positives if ($c = preg_match_all('/&[az]{2,}(?![az;])/i', $str, $matches)) { if ( ! isset($_entities)) { $_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset)); // If we're not on PHP 5.4+, add the possibly dangerous HTML 5 // entities to the array manually if ($flag === ENT_COMPAT) { $_entities[':'] = '&colon;'; $_entities['('] = '&lpar;'; $_entities[')'] = '&rpar'; $_entities["\n"] = '&newline;'; $_entities["\t"] = '&tab;'; } } $replace = array(); $matches = array_unique(array_map('strtolower', $matches[0])); for ($i = 0; $i < $c; $i++) { if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE) { $replace[$matches[$i]] = $char; } } $str = str_ireplace(array_keys($replace), array_values($replace), $str); } // Decode numeric & UTF16 two byte entities $str = html_entity_decode( preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str), $flag, $charset ); } while ($str_compare !== $str); return $str; } 
+1
source

I had the same problem because someone used HTML templates to create XML instead of using SimpleXML. sigh ... Anyway, I came up with the following. It is not as fast as yours, but it is not an order of magnitude slower and it is less hacked. Your will inadvertently convert #_x_amp#; in $amp; , however, its presence in the source XML is unlikely.

Note. I assume the default encoding is UTF-8

 // Search for named entities (strings like "&abc1;"). echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) { // Decode the entity and re-encode as XML entities. This means "&amp;" // will remain "&amp;" whereas "&euro;" becomes "€". return htmlentities(html_entity_decode($matches[0]), ENT_XML1); }, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n"; /* <Foo>€&amp;foo Ç</Foo> */ 

In addition, if you want to replace special characters with numbered objects (if you do not need UTF-8 XML), you can easily add a function to the code above:

 // Search for named entities (strings like "&abc1;"). $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) { // Decode the entity and re-encode as XML entities. This means "&amp;" // will remain "&amp;" whereas "&euro;" becomes "€". return htmlentities(html_entity_decode($matches[0]), ENT_XML1); }, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n"; echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]); /* <Foo>&#8364;&amp;foo &#199;</Foo> */ 

In your case, you want everything to be the other way around. Encode numbered objects as UTF-8:

 // Search for named entities (strings like "&abc1;"). $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) { // Decode the entity and re-encode as XML entities. This means "&amp;" // will remain "&amp;" whereas "&euro;" becomes "€". return htmlentities(html_entity_decode($matches[0]), ENT_XML1); }, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n"; // Encodes (uncaught) numbered entities to UTF-8. echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]); /* <Foo>€&amp;foo Ç</Foo> */ 

Benchmark

I added a benchmark for a good grade. It also demonstrates the lack of your solution for clarity. The following is the input line.

 <Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo> 

Your method

 php -r '$q=["&amp;","&gt;","&lt;"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&amp;foo Ç é &amp; ∬</Foo> ===== Time taken: 2.0397531986237 

My method

 php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&amp;foo Ç é #_x_amp#; &#8748;</Foo> ===== Time taken: 4.045273065567 

My method (with Unicode to a numbered object):

 php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>&#8364;&amp;foo &#199; &#233; #_x_amp#; &#8748;</Foo> ===== Time taken: 5.4407880306244 

My method (with unicode numbered object):

 php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&amp;foo Ç é #_x_amp#; ∬</Foo> ===== Time taken: 5.5400078296661 
+1
source

Try this feature:

 function xmlsafe($s,$intoQuotes=1) { if ($intoQuotes) return str_replace(array('&','>','<','"'), array('&amp;','&gt;','&lt;','&quot;'), $s); else return str_replace(array('&','>','<'), array('&amp;','&gt;','&lt;'), html_entity_decode($s)); } 

usage example:

 echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>'; 

also: fooobar.com/questions/184356 / ...

this code used in production seems to be no problem with utf-8

-one
source

Source: https://habr.com/ru/post/1495160/


All Articles