How to remove all HTML attributes in HTML tags in a string

Question

How to remove all HTML attributes in HTML tags in a string

I try to take a line with HTML, cross out some tags (img, object) and all other HTML tags, cross out their attributes. For instance:

<div id="someId" style="color: #000000"> <p class="someClass">Some Text</p> <img src="images/someimage.jpg" alt="" /> <a href="somelink.html">Some Link Text</a> </div>

It would be:

 <div> <p>Some Text</p> Some Link Text </div>

I'm trying to:

 string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object

I am not sure how to remove all attributes inside the tag.

Any help would be appreciated.

Thanks.

+4

java regex html-parsing

fanfavorite Feb 23 '12 at 15:23

source share

4 answers

I would not recommend regex for this if you want to filter specific tags. It will be a hell of a job and will never be completely reliable. Use a regular HTML parser like Jsoup . It offers a Whitelist API for HTML cleanup. See also this cookbook document .

Here is an example run using Jsoup, which allows only the <div> and <p> tags next to the standard tag set of the selected Whitelist , which is Whitelist#simpleText() in the following example.

 String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>"; Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean. whitelist.addTags("div", "p"); String clean = Jsoup.clean(html, whitelist); System.out.println(clean);

The result is

 <div> <p>Some Text</p>Some Link Text </div>

How to remove all HTML attributes in HTML tags in a string

See also:

More articles: