The easiest way to “remove” text that has XML tags is to use a regular expression that identifies everything that is a tag (that is, everything that starts with "<" and ends with ">" and everything in between). Note that this works regardless of whether the XML is “well-formed”, as it clears any tags regardless of whether the opening tags match the closing tags.
For instance,
String noXmlString = xmlString.replaceAll("\\<.*?\\>", "");
will remove all tags from the given string. The disadvantage is that it will not save the link to the image or the hyperlink according to your example. Hope this helps!
Edited 11:58 04/04/10: try this to remove HTML tags from HTML tags (i.e. everything that starts with < and ends with > ) ...
String noHtmlHtmlString = htmlHtmlString.replaceAll("<.+?>", "");
Then, to remove any other HTML encoded / formatted bits, such as " (that is, everything that begins with and ends, and between them corresponds to a valid word without spaces or gaps) use
String noHtmlEncodingString = htmlEncodingString.replaceAll("&\\w+?;", "");
If there is incorrect HTML / XML that goes beyond this, if there is no known template, there is no way to catch them.
source share