Tag strip from text extracted from XML

I am parsing XML documents. I am doing getTextContent() to get the text from the specific section that I want. The text that I get has tags like

 <italic> </italic> <sub> </sub> 

.. and a few more. I want to break these tags and just save the text, no matter what the tags are.

My document is as follows

 <article> <sec>Section 1</sec> <sec>Section 2 <title>Title1</title> <sec> <title>Subtitle1</title> <p>........<italic> </italic>...</p> </sec> <sec> <title>Subtitle2</title> <p>........<sub> </sub>...</p> </sec> </sec> </article> 

I need all the text in <p>...</p> without tags in it. How can i do this? I was thinking about identifying all tags and replacing them with "" . But there must be a better way.

thanks

+4
source share
2 answers

You can apply this reg ex to getTextContent () results

 String noHTMLString = htmlString.replaceAll("\\<.*?\\>", ""); 
+5
source

You can use perl script to access the file, and then use s/ \< .* \> //xg; to get rid of all tags.

0
source

Source: https://habr.com/ru/post/1344656/


All Articles