Jsoup whitelist: non-English character parsing

I am trying to clear HTML text and extract text from it using Jsoup . HTML may contain a non-English character.

For example, HTML text:

String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>"; 

Now, if I use Jsoup#parse(String html) :

 String text = Jsoup.parse(html).text(); 

This is the seal:

 Á example link. 

And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist) :

 String text = Jsoup.clean(html, Whitelist.none()); 

This is the seal:

 &Aacute; example link. 

My question is: how can I get the text

 Á example link. 

using the Whitelist method and clean() ? I want to use Whitelist as I may need to use Whitelist#addTags(String... tags) .

Any information would be very helpful to me.

Thanks.

+4
source share
1 answer

Not possible in current version (1.6.1), jsoup print Á like &Aacute; , since there is no screening function for the object, now there is no "do not exit" mode (check Entities.EscapeMode ).

You can 1. cancel these HTML objects, 2. extend the jsoup source code by adding a new escape mode with an empty map.

+1
source

Source: https://habr.com/ru/post/1399525/


All Articles