Convert document to pdf using Apache POI

I am trying to convert a document to pdf using Apache POI, but the resulting pdf document contains only text, it does not have any formats such as images, table alignment, etc.

How to convert doc to pdf using all forms like tables, images, alignments?

Here is my code:

import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.OutputStream; import com.lowagie.text.Document; import com.lowagie.text.DocumentException; import com.lowagie.text.Paragraph; import com.lowagie.text.pdf.PdfWriter; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.hwpf.usermodel.Range; import org.apache.poi.poifs.filesystem.POIFSFileSystem; public class demo { public static void main(String[] args) { POIFSFileSystem fs = null; Document document = new Document(); try { System.out.println("Starting the test"); fs = new POIFSFileSystem(new FileInputStream("Resume.doc")); HWPFDocument doc = new HWPFDocument(fs); WordExtractor we = new WordExtractor(doc); OutputStream file = new FileOutputStream(new File("test.pdf")); PdfWriter writer = PdfWriter.getInstance(document, file); Range range = doc.getRange(); document.open(); writer.setPageEmpty(true); document.newPage(); writer.setPageEmpty(true); String[] paragraphs = we.getParagraphText(); for (int i = 0; i < paragraphs.length; i++) { org.apache.poi.hwpf.usermodel.Paragraph pr = range.getParagraph(i); paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", ""); System.out.println("Length:" + paragraphs[i].length()); System.out.println("Paragraph" + i + ": " + paragraphs[i].toString()); // add the paragraph to the document document.add(new Paragraph(paragraphs[i])); } System.out.println("Document testing completed"); } catch (Exception e) { System.out.println("Exception during test"); e.printStackTrace(); } finally { // close the document document.close(); } } } 
+5
source share
3 answers

The task is to convert doc to pdf with all formats, such as tables, images, alignments.

Creating your own converter class

Apache POI already has the WordToXxxConverter classes, namely WordToFoConverter , WordToHtmlConverter , and WordToTextConverter . The latter is most likely too insignificant to serve as an example for your requirements, but the first two are adequate.

All these converter classes are derived from the common AbstractWordConverter base class, which provides the base structure for word conversion classes. In addition, all these classes use the corresponding *DocumentFacade class, which wraps the creation of a specific target (or some intermediate) format: FoDocumentFacade , HtmlDocumentFacade, or TextDocumentFacade .

To implement the task of converting a document to pdf with all forms, such as tables, images, alignments, therefore, you should also get the converter class from AbstractWordConverter and for the implementation of abstract methods, let yourself be inspired by three specific implementation classes. Just like in other converter classes, concentrating the most specific PDF library code in the PdfDocumentFacade class seems like a good idea.

If you want to start simple and add more complex details later, you can start by using mostly WordToTextConverter implementation code and, as soon as it works to a lesser extent at the level of evidence-based concept, expand the functionality to also cover more and more formatting information.

Unfortunately, this converter infrastructure has several DOM elements: AbstractWordConverter callbacks expect and forward DOM elements as indicators of the context of the current target document; at first glance, it seems that this context is not a DOM element, so you can get rid of copying this base class and exchanging these parameters of the DOM element with a more apropos type or even with a better class parameter.

Using existing Word-to-XXX converters in combination with existing XXX-Pdf converters

If this seems too complicated or time-consuming for your resources, you can try a different approach: you can try using the output of one of the existing converters mentioned above as input for another conversion to Pdf.

Using existing conversion classes will lead to results earlier, but multi-stage conversions are usually more unprofitable than single-step ones. The decision is up to you.

In the code you posted in your question, you used the iText classes. iText supports conversion from HTML to PDF with certain restrictions using the XMLWorker provided in the iText XML Worker subproject. Ancient versions of iText also used the now obsolete HTMLWorker class. Thus, using WordToHtmlConverter in combination with iText XMLWorker may be your option.

Alternatively, Apache also provides XSL FO processing in PDF. This applies to the output of WordToFoConverter can also be an option

+8
source

As an alternative to POI (but still in the Java domain) you can consider docx4j (which I maintain / maintain).

For docx files, docx4j can convert to PDF by first converting to FO and then using FOP to convert to PDF.

For old doc binaries (as well as docx files) we have a high-performance commercial solution. You can try this at http://converter-eval.plutext.com/plutext/converter or get more information at http://www.plutext.com/m/index.php/products-docx-to-pdf. html

+2
source

I used OpenOffice / LibreOffice to export to PDF, it has automation support, i.e. sort of

 unoconv -vvv --timeout=10 --doctype=document --output=result.pdf result.docx 

converts a document to pdf.

+1
source

Source: https://habr.com/ru/post/1493112/


All Articles