How can I convert a PDF file to a text file using Java

How to convert pdf file to text file using Java?

And is it easy as it seems?

+4
source share
2 answers

Try PDFBOX

public class PDFTextReader { static String pdftoText(String fileName) { PDFParser parser; String parsedText = null; PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File(fileName); if (!file.isFile()) { System.err.println("File " + fileName + " does not exist."); return null; } try { parser = new PDFParser(new FileInputStream(file)); } catch (IOException e) { System.err.println("Unable to open PDF Parser. " + e.getMessage()); return null; } try { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); } catch (Exception e) { System.err .println("An exception occured in parsing the PDF Document." + e.getMessage()); } finally { try { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); } catch (Exception e) { e.printStackTrace(); } } return parsedText; } public static void main(String args[]){ try { String content = pdftoText(PDF_FILE_PATH); File file = new File("/sample/filename.txt"); // if file doesnt exists, then create it if (!file.exists()) { file.createNewFile(); } FileWriter fw = new FileWriter(file.getAbsoluteFile()); BufferedWriter bw = new BufferedWriter(fw); bw.write(content); bw.close(); System.out.println("Done"); } catch (IOException e) { e.printStackTrace(); } } } 
+7
source

I studied this question deeply and I found that for the correct results you do not need to avoid using MS Word. Even funded projects, such as LibreOffice, with proper Word conversion, are quite complex and change from version to version. Only MS Word tracks this.

For this reason, I implemented documents4j that delegates conversions to MS Word using the Java API. In addition, it allows you to move conversions to another machine that you can contact using the REST API. You will find detailed information on the GitHub page .

+4
source

Source: https://habr.com/ru/post/1494645/


All Articles