Clearing a string consisting of html / server-side tags in Java

I have text like:

Today I have a meeting with this guy. Well, I and thousands of others. <img src = " http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg"> <br /> <br /> <br> Tomorrow morning I will get up in stupid hours and drive up to Manchester, NH to speak with Barack Obama. You must also come! <br>> a href = " http://nh.barackobama.com/manchesterchange"> RSVP for the event </a>

I would also like to clear it:

I have a meeting with this guy tomorrow. Well, there are thousands of others and me http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg Tomorrow morning I will get up silly hours and drive up to Manchester, NH to see Barack Obama talking. You all will come too! h ** p: //nh.barackobama.com/manchesterchange RSVP for event

I would like to write a JAVA program for her. Any pointers / suggestions would be appreciated. Tags are not limited to the message above. This was just an example.

Thanks!

PS: Replace * with t in the second hyperlink, since Stack does not allow sending more than one link.

+4
source share
3 answers

The easiest way to “remove” text that has XML tags is to use a regular expression that identifies everything that is a tag (that is, everything that starts with "<" and ends with ">" and everything in between). Note that this works regardless of whether the XML is “well-formed”, as it clears any tags regardless of whether the opening tags match the closing tags.

For instance,

String noXmlString = xmlString.replaceAll("\\<.*?\\>", ""); 

will remove all tags from the given string. The disadvantage is that it will not save the link to the image or the hyperlink according to your example. Hope this helps!

Edited 11:58 04/04/10: try this to remove HTML tags from HTML tags (i.e. everything that starts with &lt; and ends with &gt; ) ...

 String noHtmlHtmlString = htmlHtmlString.replaceAll("&lt;.+?&gt;", ""); 

Then, to remove any other HTML encoded / formatted bits, such as &quot; (that is, everything that begins with and ends, and between them corresponds to a valid word without spaces or gaps) use

 String noHtmlEncodingString = htmlEncodingString.replaceAll("&\\w+?;", ""); 

If there is incorrect HTML / XML that goes beyond this, if there is no known template, there is no way to catch them.

0
source

JTidy will do what you want. I just tried this, saving the block of text in your message as test.txt and doing JTidy with these parameters:

 java -jar jtidy-r938.jar -asxml test.txt >test.html 

He created the following well-formed XHTML:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" /> <title></title> </head> <body> I've got a date with this fellow tomorrow. Well me and thousands of others. <br /> <br /> <img src="http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg" /><br /> <br /> Tomorrow morning I will be getting up at stupid o'clock and driving up to Manchester, NH to see Barak Obama speak. <br /> <br /> You all should come too!<br /> <br /> <a href="http://nh.barackobama.com/manchesterchange">RSVP for the event</a> </body> </html> 

If you use the API instead of the command line, you can extract the bits you are interested in and discard the rest.

+1
source

I would look at an HTML parser like JTidy . Despite its name, it will parse HTML and provide a useful API that will allow you to extract what you need.

0
source

Source: https://habr.com/ru/post/1305928/


All Articles