I use the iText PDF library to convert PDF to text.
Below is my code for converting PDF to text file using Java.
public class PdfConverter { /** The original PDF that will be parsed. */ public static final String pdfFileName = "jdbc_tutorial.pdf"; /** The resulting text file. */ public static final String RESULT = "preface.txt"; /** * Parses a PDF to a plain text file. * @param pdf the original PDF * @param txt the resulting text * @throws IOException */ public void parsePdf(String pdf, String txt) throws IOException { PdfReader reader = new PdfReader(pdf); PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(txt)); TextExtractionStrategy strategy; for (int i = 1; i <= reader.getNumberOfPages(); i++) { strategy = parser.processContent(i, new SimpleTextExtractionStrategy()); out.println(strategy.getResultantText()); System.out.println(strategy.getResultantText()); } out.flush(); out.close(); reader.close(); } /** * Main method. * @param args no arguments needed * @throws IOException */ public static void main(String[] args) throws IOException { new PdfConverter().parsePdf(pdfFileName, RESULT); } }
The above code works to extract PDF to text. But my requirement is to ignore the header and footer and only extract content from the PDF file.
pdf , ( , ). , ParseTaggedPdf. ExtractPageContentArea, ParseTaggedPdf . , .
. , apache API, PdfBox, tika , PDFTextStream. , , , iText . PdfBox PDFTextStripperByArea PDFTextStripper. JavaDoc , , .
IText, http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/
, , .
PdfReader reader = new PdfReader(pdf); PrintWriter out= new PrintWriter(new FileOutputStream(txt)); //Creating the rectangle Rectangle rect=new Rectangle(70,80,420,500); //creating a filter based on the rectangle RenderFilter filter = new RegionTextRenderFilter(rect); TextExtractionStrategy strategy; for(int i=1;i<=reader.getNumberOfPages();i+){ //setting the filter on the text extraction strategy strategy= new FilteredTextRenderListener( new LocationTextExtractionStrategy(),filter); out.println(PdfTextExtractor.getTextFromPage(reader,i,strategy)); } out.flush();out.close();
-, , PDF .
Source: https://habr.com/ru/post/1570746/More articles:NoMethodError in Rails and minitest / spec - ruby-on-rails-4Do you release scipy.sparse GIL functions? - pythonMemory leak in Sprite Kit - iosSpriteKit account randomly runs - iosCKEditor: how to hide spell check button - javascriptelasticsearch: save redundant (denormalized) data or save list of identifiers for cross-references? - javascriptBuilding std :: map from error initializer_list - c ++How to get java objects from JSONArray url using Jackson in Android - javaHaskell pattern name quoting desugar 'x to NameG? - haskellMongoimport speed when using -jsonArray is very slow - performanceAll Articles