Regular expression to remove HTML tags from a string

Possible duplicate:
Regular expression to remove HTML tags

Is there an expression that gets a value between two HTML tags?

Considering this:

<td class="played">0</td> 

I am looking for an expression that will return 0 by separating the <td> tags.

+44
html regex
Jun 27 '12 at 15:30
source share
3 answers

The following examples are Java, but the regex will be similar - if not identical - to other languages.




 String target = someString.replaceAll("<[^>]*>", ""); 

Assuming your non-html does not contain any <or> and that your input line is properly structured.

If you know that this is a specific tag - for example, you know that the text contains only <td> tags, you can do something like this:

 String target = someString.replaceAll("(?i)<td[^>]*>", ""); 

Edit: Ωmega raised a good point in the comment on another post that this would lead to several results, all flatten together if there were several tags.

For example, if the input string was <td>Something</td><td>Another Thing</td> , then the above will result in SomethingAnother Thing .

In a situation where multiple tags are expected, we could do something like:

 String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim(); 

This replaces the HTML with one space, then collapses the spaces, and then trims everything at the ends.

+91
Jun 27 2018-12-12T00:
source share

A trivial approach would be to replace

 <[^>]*> 

with nothing. But depending on how poorly structured your input is, which can greatly fail.

+33
Jun 27 '12 at 15:31
source share

You can do this with jsoup http://jsoup.org/

 Whitelist whitelist = Whitelist.none(); String cleanStr = Jsoup.clean(yourText, whitelist); 
+3
Jun 27 '12 at 15:34
source share



All Articles