Java replaces all non-HTML tags with String

I would like to replace all tag-looking parts in String if they are not valid HTML tags. The part similar to the tag is enclosed in brackets <> . For instance. < myemail@email.com > or <hello> , but <br> , <div> , etc. need to save.

Do you have any ideas how to achieve this?

Any help is appreciated!

amuses

Bolage

+4
source share
4 answers

You can use JSoup to clear the HTML.

 String cleaned = Jsoup.clean(html, Whitelist.relaxed()); 

You can use one of the defined whitelists or create your own custom one, in which you specify which HTML elements you want to allow through the cleaner. Everything else is deleted.


Your specific example:

 String html = "one two three <blabla> four <text> five <div class=\"bold\">six</div>"; String cleaned = Jsoup.clean(html, Whitelist.relaxed().addAttributes("div", "class")); System.out.println(cleaned); 

Output:

 one two three four five <div class="bold"> six </div> 
+8
source

Look at the java.util.Scanner class — you can set the delimiter, and then see if the string matches the HTML tag or not — you will need to build an array of strings that should be ignored.

0
source

You can also include end tags in your comparison algorithm. So you may want to find the slash (html end tag) and break it before comparing.

0
source

If you do this to display untrusted data on a web page, simply removing the invalid tags is not enough. Take a look at OWASP AntiSamy .

0
source

Source: https://habr.com/ru/post/1335550/


All Articles