StringEscapeUtils.escapeXml converts utf8 characters that should not

The escapeXml function converts ѭ Ѯ to & # 1133; & Amp; # 1134; which, I think, should not. I read that it supports only five basic XML objects (gt, lt, quot, amp, apos).

Is there a function that converts only these five basic xml objects.

+6
source share
4 answers
public String escapeXml(String s) { return s.replaceAll("&", "&amp;").replaceAll(">", "&gt;").replaceAll("<", "&lt;").replaceAll("\"", "&quot;").replaceAll("'", "&apos;"); } 
+11
source

javadoc for version 3.1 version says:

Note that Unicode characters greater than 0x7f are 3.0, no longer escaped. If you still want this functionality, you can achieve this through the following: StringEscapeUtils.ESCAPE_XML.with (NumericEntityEscaper.between (0x7f, Integer.MAX_VALUE));

So, you are probably using an older version of the library. Update your dependencies (or redefine the escape yourself: it's not rocket science)

+6
source

StringEscapeUtils.escapeXml says we should use

 StringEscapeUtils.ESCAPE_XML.with( new UnicodeEscaper(Range.between(0x7f, Integer.MAX_VALUE)) ); 

But instead of UnicodeEscaper you need to use NumericEntityEscaper . UnicodeEscaper will change everything to \u1234 , but NumericEntityEscaper as &amp;#123; It was expected.

 package mypackage; import org.apache.commons.lang3.StringEscapeUtils; import org.apache.commons.lang3.text.translate.CharSequenceTranslator; import org.apache.commons.lang3.text.translate.NumericEntityEscaper; public class XmlEscaper { public static void main(final String[] args) { final String xmlToEscape = "<hello>Hi</hello>" + "_ _" + "__ __" + "___ ___" + "after &nbsp;"; // the line cont // no Unicode escape final String escapedXml = StringEscapeUtils.escapeXml(xmlToEscape); // escape Unicode as numeric codes. For instance, escape non-breaking space as &#160; final CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) ); final String escapedXmlWithUnicode = translator.translate(xmlToEscape); System.out.println("xmlToEscape: " + xmlToEscape); System.out.println("escapedXml: " + escapedXml); // does not escape Unicode characters like non-breaking space System.out.println("escapedXml with unicode: " + escapedXmlWithUnicode); // escapes Unicode characters } } 
+2
source

In times of XML documents in UTF-8 having readable characters, it is sometimes preferable. This should work and re-arranging String only happens once.

 import java.util.regex.Matcher; import java.util.regex.Pattern; private static final Pattern ESCAPE_XML_CHARS = Pattern.compile("[\"&'<>]"); public static String escapeXml(String s) { Matcher m = ESCAPE_XML_CHARS.matcher(s); StringBuffer buf = new StringBuffer(); while (m.find()) { switch (m.group().codePointAt(0)) { case '"': m.appendReplacement(buf, "&quot;"); break; case '&': m.appendReplacement(buf, "&amp;"); break; case '\'': m.appendReplacement(buf, "&apos;"); break; case '<': m.appendReplacement(buf, "&lt;"); break; case '>': m.appendReplacement(buf, "&gt;"); break; } } m.appendTail(buf); return buf.toString(); } 
0
source

Source: https://habr.com/ru/post/1392648/


All Articles