Jsoup cancels special characters

I use Jsoup to remove all images from an HTML page. I get the page through an HTTP response that also contains the encoding of the content.

The problem is that Jsoup unescapes some special characters.

For example, to enter:

<html><head></head><body><p>isn&rsquo;t</p></body></html> 

After launch

 String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>"; Document doc = Jsoup.parse(check); System.out.println(doc.outerHtml()); 

I get:

 <html><head></head><body><p>isn't</p></body></html><p></p> 

I want to avoid changing the html in any other way than deleting images.

Using the command:

 doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended); 

I get the correct conclusion, but I am sure that there are cases when this encoding will not be good. I just want to use the encoding specified in the HTTP header, and I am afraid that this will change my document in a way I cannot predict. Is there any other method for cleaning images without any unintentional changes?

Thanks!

+5
source share
1 answer

The following is a workaround that is not associated with any encoding other than the one specified in the HTTP header.

 String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>".replaceAll("&([^;]+?);", "**$1;"); Document doc = Jsoup.parse(check); doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended); System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;")); 

OUTPUT

 <html><head></head><body><p>isn&rsquo;t</p></body></html> 

DISCUSSION

I want Jsoup API to have a solution - @dlv

Using Jsoup'API will require you to create a custom NodeVisitor. This will lead to (re) creating some existing code inside Jsoup. Custom Nodevisitor generates back HTML escape code instead of the unicode character.

Another option would be to write a custom character encoder. UTF-8 character encoder can encode &rsquo; by default . This is why Jsoup does not save the original escape sequence in the final HTML code.

Any of the two options above is a great coding effort. Ultimately, you can add an extension to Jsoup that allows us to choose how to generate characters in the final HTML code: hexadecimal escape ( &#AB; ), decimal escape ( &#151; ), the original escape sequence ( &rsquo; ), or write encoded symbol (which is the case in your message).

+3
source

Source: https://habr.com/ru/post/1238695/


All Articles