Processing Surrogate pairs in xml analysis using libxml2

I am trying to parse xml using libxml2. However, sometimes I get the code points of surrogate pairs in it that are outside the range specified in http://www.w3.org/TR/REC-xml/#NT-Char
Because of this, my libxml2 parser cannot parse it, and therefore I get an error. Can someone tell me how to handle surrogate pairs when parsing XML using libxml2.

An example xml that I want to parse:

<?xml version="1.0" encoding="UTF-8"?> <message><body> &#xD83D;&#xD83D;</body></message> 
0
source share
2 answers

Please note that xD83D is a high surrogate. A surrogate pair consists of a high surrogate and a low surrogate; the presence of two high surrogates next to each other is not a "surrogate pair", this is nonsense.

Also note that the correct way to represent a non-BMP character in XML is the only character reference for a combined character, for example &#x120AB; . Separation of a non-BMP symbol into two surrogates is necessary in some character encodings, but it is not needed (or allowed) in XML symbol references. Symbolic links in XML are Unicode code points, not numeric values ​​specific to a particular character encoding.

If you cannot fix the program that created this bad XML, a better solution would be to repair with a script for example. in Perl, which looks for invalid pairs of character references and replaces them with the correct XML representation.

+3
source

If the XML standard does not allow these characters, the parser will throw an error. One way to include these characters in xml is to place them inside a CDATA segment. they are used to exit blocks of text containing characters that would otherwise be recognized as markup.

 <message><body> <![CDATA[&#xD83D;&#xD83D;&#xD83D;]]></body></message> 

The above xml will correctly parse.

0
source

Source: https://habr.com/ru/post/1274565/


All Articles