I have an html file:
<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"> <div>Test message.</div> <div> </div> <div>More content here...</div> <div> </div> <div>Best regards,</div> <div>Mr. Crowley</div></div></body></html>
I am trying to get the contents of the file above using Apache Tika ...
final InputStream input = new FileInputStream("file.html"); final ContentHandler handler = new BodyContentHandler(); final Metadata metadata = new Metadata(); final HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(input, handler, metadata, new ParseContext()); String plainText = handler.toString(); System.out.println(plainText);
... and everything is fine, except for the extra lines:
Test message. More content here... Best regards, Mr. Crowley <and 3 empty lines here>
Can this behavior be avoided? Is it possible to get a more expected result:
Test message. More content here... Best regards, Mr. Crowley
?
Code constructs such as
plainText = plainText.replaceAll("(\n)+", "\n");
Unfortunately, it is impossible for me here. Also, I cannot change the structure of my HTML file.
source share