I had the same problem because someone used HTML templates to create XML instead of using SimpleXML. sigh ... Anyway, I came up with the following. It is not as fast as yours, but it is not an order of magnitude slower and it is less hacked. Your will inadvertently convert #_x_amp#; in $amp; , however, its presence in the source XML is unlikely.
Note. I assume the default encoding is UTF-8
In addition, if you want to replace special characters with numbered objects (if you do not need UTF-8 XML), you can easily add a function to the code above:
// Search for named entities (strings like "&abc1;"). $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) { // Decode the entity and re-encode as XML entities. This means "&" // will remain "&" whereas "€" becomes "€". return htmlentities(html_entity_decode($matches[0]), ENT_XML1); }, "<Foo>€&foo Ç</Foo>") . "\n"; echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]); /* <Foo>&
In your case, you want everything to be the other way around. Encode numbered objects as UTF-8:
// Search for named entities (strings like "&abc1;"). $xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) { // Decode the entity and re-encode as XML entities. This means "&" // will remain "&" whereas "€" becomes "€". return htmlentities(html_entity_decode($matches[0]), ENT_XML1); }, "<Foo>€&foo Ç</Foo>") . "\n"; // Encodes (uncaught) numbered entities to UTF-8. echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]); /* <Foo>€&foo Ç</Foo> */
Benchmark
I added a benchmark for a good grade. It also demonstrates the lack of your solution for clarity. The following is the input line.
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
Your method
php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&foo Ç é & ∬</Foo> ===== Time taken: 2.0397531986237
My method
php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&foo Ç é
My method (with Unicode to a numbered object):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>&
My method (with unicode numbered object):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";' <Foo>€&foo Ç é