Apache tika: remove extra lines in the result line

Question

Apache tika: remove extra lines in the result line

I have an html file:

<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"> <div>Test message.</div> <div>&nbsp;</div> <div>More content here...</div> <div>&nbsp;</div> <div>Best regards,</div> <div>Mr. Crowley</div></div></body></html>

I am trying to get the contents of the file above using Apache Tika ...

 final InputStream input = new FileInputStream("file.html"); final ContentHandler handler = new BodyContentHandler(); final Metadata metadata = new Metadata(); final HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(input, handler, metadata, new ParseContext()); String plainText = handler.toString(); System.out.println(plainText);

... and everything is fine, except for the extra lines:

 Test message. More content here... Best regards, Mr. Crowley <and 3 empty lines here>

Can this behavior be avoided? Is it possible to get a more expected result:

 Test message. More content here... Best regards, Mr. Crowley

?

Code constructs such as

 plainText = plainText.replaceAll("(\n)+", "\n");

Unfortunately, it is impossible for me here. Also, I cannot change the structure of my HTML file.

+4

java apache-tika

hard-code Jul 04 '13 at 17:26

source share

1 answer

Andrey · Accepted Answer · 2014-12-18T16:26:35+0000

One solution is to implement a custom ContentHandler that will not write these new lines (new lines from the original document are saved):

 public class OriginalBodyContentHandler extends BodyContentHandler { @Override public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { // Not writing extra new lines generated by XHTMLContentHandler. } }

Apache tika: remove extra lines in the result line

More articles: