Regular expression to remove HTML tags from a string

Question

Regular expression to remove HTML tags from a string

Possible duplicate:
Regular expression to remove HTML tags

Is there an expression that gets a value between two HTML tags?

Considering this:

<td class="played">0</td>

I am looking for an expression that will return 0 by separating the <td> tags.

+44

html regex

danny Jun 27 '12 at 15:30

source share

3 answers

Roddy of the Frozen Peas · Answer 1 · 2012-06-27 15:42

The following examples are Java, but the regex will be similar - if not identical - to other languages.

 String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any <or> and that your input line is properly structured.

If you know that this is a specific tag - for example, you know that the text contains only <td> tags, you can do something like this:

 String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega raised a good point in the comment on another post that this would lead to several results, all flatten together if there were several tags.

For example, if the input string was <td>Something</td><td>Another Thing</td> , then the above will result in SomethingAnother Thing .

In a situation where multiple tags are expected, we could do something like:

 String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();

This replaces the HTML with one space, then collapses the spaces, and then trims everything at the ends.

Joey · Answer 2 · 2012-06-27 15:31

A trivial approach would be to replace

 <[^>]*>

with nothing. But depending on how poorly structured your input is, which can greatly fail.

mihaisimi · Answer 3 · 2012-06-27 15:34

You can do this with jsoup http://jsoup.org/

 Whitelist whitelist = Whitelist.none(); String cleanStr = Jsoup.clean(yourText, whitelist);

Regular expression to remove HTML tags from a string

More articles: