I am trying to clear HTML text and extract text from it using Jsoup . HTML may contain a non-English character.
For example, HTML text:
String html = "<p>Á <a href='http://example.com/'><b>example</b></a> link.</p>";
Now, if I use Jsoup#parse(String html) :
String text = Jsoup.parse(html).text();
This is the seal:
Á example link.
And if I clean the text using Jsoup#clean(String bodyHtml, Whitelist whitelist) :
String text = Jsoup.clean(html, Whitelist.none());
This is the seal:
Á example link.
My question is: how can I get the text
Á example link.
using the Whitelist method and clean() ? I want to use Whitelist as I may need to use Whitelist#addTags(String... tags) .
Any information would be very helpful to me.
Thanks.
source share