The following is a workaround that is not associated with any encoding other than the one specified in the HTTP header.
String check = "<html><head></head><body><p>isn’t</p></body></html>".replaceAll("&([^;]+?);", "**$1;"); Document doc = Jsoup.parse(check); doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended); System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));
OUTPUT
<html><head></head><body><p>isn’t</p></body></html>
DISCUSSION
I want Jsoup API to have a solution - @dlv
Using Jsoup'API will require you to create a custom NodeVisitor. This will lead to (re) creating some existing code inside Jsoup. Custom Nodevisitor generates back HTML escape code instead of the unicode character.
Another option would be to write a custom character encoder. UTF-8 character encoder can encode ’ by default . This is why Jsoup does not save the original escape sequence in the final HTML code.
Any of the two options above is a great coding effort. Ultimately, you can add an extension to Jsoup that allows us to choose how to generate characters in the final HTML code: hexadecimal escape ( &#AB; ), decimal escape ( — ), the original escape sequence ( ’ ), or write encoded symbol (which is the case in your message).
source share