Apache tika: remove extra lines in the result line

I have an html file:

<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"> <div>Test message.</div> <div>&nbsp;</div> <div>More content here...</div> <div>&nbsp;</div> <div>Best regards,</div> <div>Mr. Crowley</div></div></body></html> 

I am trying to get the contents of the file above using Apache Tika ...

 final InputStream input = new FileInputStream("file.html"); final ContentHandler handler = new BodyContentHandler(); final Metadata metadata = new Metadata(); final HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(input, handler, metadata, new ParseContext()); String plainText = handler.toString(); System.out.println(plainText); 

... and everything is fine, except for the extra lines:

 Test message. More content here... Best regards, Mr. Crowley <and 3 empty lines here> 

Can this behavior be avoided? Is it possible to get a more expected result:

 Test message. More content here... Best regards, Mr. Crowley 

?

Code constructs such as

 plainText = plainText.replaceAll("(\n)+", "\n"); 

Unfortunately, it is impossible for me here. Also, I cannot change the structure of my HTML file.

+4
source share
1 answer

One solution is to implement a custom ContentHandler that will not write these new lines (new lines from the original document are saved):

 public class OriginalBodyContentHandler extends BodyContentHandler { @Override public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { // Not writing extra new lines generated by XHTMLContentHandler. } } 
+6
source

Source: https://habr.com/ru/post/1489773/


All Articles