Tag strip from text extracted from XML

Question

Tag strip from text extracted from XML

I am parsing XML documents. I am doing getTextContent() to get the text from the specific section that I want. The text that I get has tags like

 <italic> </italic> <sub> </sub>

.. and a few more. I want to break these tags and just save the text, no matter what the tags are.

My document is as follows

 <article> <sec>Section 1</sec> <sec>Section 2 <title>Title1</title> <sec> <title>Subtitle1</title> <p>........<italic> </italic>...</p> </sec> <sec> <title>Subtitle2</title> <p>........<sub> </sub>...</p> </sec> </sec> </article>

I need all the text in <p>...</p> without tags in it. How can i do this? I was thinking about identifying all tags and replacing them with "" . But there must be a better way.

thanks

+4

java xml-parsing

y2p Mar 21 '11 at 18:49

source share

2 answers

You can use perl script to access the file, and then use s/ \< .* \> //xg; to get rid of all tags.

0

B. Bowles Mar 21 '11 at 18:58

source share

Kevin d · Accepted Answer · 2011-03-21T19:23:12+0000

You can apply this reg ex to getTextContent () results

 String noHTMLString = htmlString.replaceAll("\\<.*?\\>", "");

Tag strip from text extracted from XML

More articles: