How to remove all HTML attributes in HTML tags in a string

I try to take a line with HTML, cross out some tags (img, object) and all other HTML tags, cross out their attributes. For instance:

<div id="someId" style="color: #000000"> <p class="someClass">Some Text</p> <img src="images/someimage.jpg" alt="" /> <a href="somelink.html">Some Link Text</a> </div> 

It would be:

 <div> <p>Some Text</p> Some Link Text </div> 

I'm trying to:

 string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object 

I am not sure how to remove all attributes inside the tag.

Any help would be appreciated.

Thanks.

+4
source share
4 answers

You can remove all attributes like this:

 string.replaceAll("(<\\w+)[^>]*(>)", "$1$2"); 

This expression corresponds to an open tag, but captures only its title <div , and closing > as groups 1 and 2. replaceAll uses references to these groups to combine them back as $1$2 . This shortens the attributes in the middle of the tag.

+7
source

I would not recommend regex for this if you want to filter specific tags. It will be a hell of a job and will never be completely reliable. Use a regular HTML parser like Jsoup . It offers a Whitelist API for HTML cleanup. See also this cookbook document .

Here is an example run using Jsoup, which allows only the <div> and <p> tags next to the standard tag set of the selected Whitelist , which is Whitelist#simpleText() in the following example.

 String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>"; Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean. whitelist.addTags("div", "p"); String clean = Jsoup.clean(html, whitelist); System.out.println(clean); 

The result is

 <div> <p>Some Text</p>Some Link Text </div> 

See also:

+8
source

/<(/?\w+) .*?>/<\1>/ can work - takes a tag (corresponding group) and reads any attributes to the closing bracket and replaces it only with backets and tag.

+1
source

It would probably be a lot easier if you use SAX or DOM and accept the name and value of the node and remove all attributes.

-one
source

Source: https://habr.com/ru/post/1398061/


All Articles