Possible removal of unwanted tag text fields

I tested Jsoup and I cannot remove text indexes of unwanted tags. Idk if I am wrong. Method:

String pretty = Jsoup.clean("<img src=\"marco\">Capretta</img><i>Sono misterioso</i><p color=\"white\"><font size=\"5\">Ciao</p><p>some text</p><br/> <p>another text</p></font>" , "", Whitelist.basic().addTags("br", "p","i"), new Document.OutputSettings().prettyPrint(true));
System.out.println(pretty);

Result:

Capretta
<i>Sono misterioso</i>
<p>Ciao</p>
<p>some text</p>
<br> 
<p>another text</p>

But I don't need text notes <img>(also valid for other unwanted tags) ...

So the result is better:

<i>Sono misterioso</i>
<p>Ciao</p>
<p>some text</p>
<br> 
<p>another text</p>

I may have another html ...

Ps The question is that Java, not Javascript !!!

+4
source share
2 answers

Assuming your HTML is pretty simple.

you can achieve this with parsing HTMLfirst, then getchildren of body tag

String pretty = Jsoup.clean("<img src=\"marco\">Capretta</img><i>Sono misterioso</i><p color=\"white\"><font size=\"5\">Ciao</p><p>some text</p><br/> <p>another text</p></font>" , "", Whitelist.basic().addTags("br", "p","i"), new Document.OutputSettings().prettyPrint(true));
pretty= Jsoup.parse(pretty).getElementsByTag("body").get(0).children().toString();
System.out.println(pretty);

OUTPUT:

<i>Sono misterioso</i>
<p>Ciao</p>
<p>some text</p>
<br />
<p>another text</p>
+2
source

, , : <img> html , .. . </img>, , html. . .

, JSoup </img>, .

XML Jsoup:

String html = "<img src=\"marco\">Capretta</img><i>Sono misterioso</i>"
            + "<p color=\"white\"><font size=\"5\">Ciao</p>"
            + "<p>some text</p><br/> <p>another text</p></font>";
Document xmldoc = Jsoup.parse(html, "",Parser.xmlParser());
Elements imgs = xmldoc.select("img");
imgs.remove();

System.out.println(xmldoc);
+1

Source: https://habr.com/ru/post/1626998/


All Articles