Create PDF from Word (DOC) using Apache POI and iText in JAVA

I am trying to create a PDF document from a * .doc document. So far, and thanks to stackoverflow, I have been successful, but with some problems.

My sample code creates PDF without formatting and images, only text. The document includes spaces and images that are not included in the PDF.

Here is the code:

in = new FileInputStream(sourceFile.getAbsolutePath()); out = new FileOutputStream(outputFile); WordExtractor wd = new WordExtractor(in); String text = wd.getText(); Document pdf= new Document(PageSize.A4); PdfWriter.getInstance(pdf, out); pdf.open(); pdf.add(new Paragraph(text)); 
+6
source share
5 answers

docx4j includes code for creating a PDF from docx using iText. It can also use POI to convert a document to docx.

There was a time when we supported both methods the same way (as well as PDF via XHTML), but we decided to focus on XSL-FO.

If this is an option, you would be much better off using docx4j to convert docx to PDF via XSL-FO and FOP.

Use it like this:

  wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath)); // Set up font mapper Mapper fontMapper = new IdentityPlusMapper(); wordMLPackage.setFontMapper(fontMapper); // Example of mapping missing font Algerian to installed font Comic Sans MS PhysicalFont font = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS"); fontMapper.getFontMappings().put("Algerian", font); org.docx4j.convert.out.pdf.PdfConversion c = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage); // = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage); OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf"); c.output(os); 

Update July 2016

As with docx4j 3.3.0, the commercial rendering of a Plutext PDF file is the default docx4j option for converting docx to PDF. You can try the online demo at converter-eval.plutext.com

If you want to use the existing docx for XSL-FO for PDF (or other targeted support supported by Apache FOP), then simply add the jar docx4j-export-FO to your classpath.

In any case, to convert docx to PDF, you can use the Docx4J facade toPDF method .

The old docx in PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-additions/PdfViaIText/

+11
source

WordExtractor just captures plain text, nothing more. That's why all you see is plain text.

What you need to do is get each paragraph separately, then grab each run, get formatting, and generate the PDF equivalent.

One option might be to find code that turns XHTML to PDF. Then use Apache Tika to turn your text document into XHTML (it uses the POI under the hood and processes all the formatting materials for you) and from XHTML to PDF.

Otherwise, if you do it yourself, look at the code in Apache Tika for parsing text files. This is a great example of how to get images, formatting, styles, etc.

+2
source

I have successfully used Apache FOP to convert a WordML document to PDF. WordML is a way to save a Word Word 2003 document as xml. XSLT style sheets can be found on the Internet to convert this xml to xml-fo, which in turn can be displayed by FOP to PDF (among other outputs).

This is not so different from the proposed plutext solution, except that it does not read the .doc document, and docx4j apparently does. If your requirements are flexible enough to have WordML documents as input, it might be interesting to explore.

Good luck with your project! Wim

+1
source

Use OpenOffice / LbreOffice and JODConnector. It also basically works with .doc on .docx. Graphics issues that I haven't developed yet, though.

  private static void transformDocXToPDFUsingJOD(File in, File out) { OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager); DocumentFormat pdf = converter.getFormatRegistry().getFormatByExtension("pdf"); converter.convert(in, out, pdf); } private static OfficeManager officeManager; @BeforeClass public static void setupStatic() throws IOException { /*officeManager = new DefaultOfficeManagerConfiguration() .setOfficeHome("C:/Program Files/LibreOffice 3.6") .buildOfficeManager(); */ officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager(); officeManager.start(); } @AfterClass public static void shutdownStatic() throws IOException { officeManager.stop(); } 

You need to run LibreOffice as serverto to make this work. From the command line you can do this using

 "C:\Program Files\LibreOffice 3.6\program\soffice.exe" -accept="socket,host=0.0.0.0,port=8100;urp;LibreOffice.ServiceManager" -headless -nodefault -nofirststartwizard -nolockcheck -nologo -norestore 
+1
source

Another option I've come across lately is to use the OpenOffice (or LibreOffice) API ( see here ). I could not understand this, but he should be able to open documents in various formats and display them in pdf format. If you look at this, let me know how it works!

0
source

Source: https://habr.com/ru/post/888546/


All Articles