Read pdf uploadstream one page at a time with java

I am trying to read a pdf in a j2ee application.

For a web application, I have to store PDF documents on disk. To simplify the search, I want to reverse-index the text inside the document; if it is an OCR.

In the PDFbox library, you can create a pdfDocument object that contains the entire PDF file. However, to save memory and improve overall performance, I would rather process the document as a stream and immediately read one page into the buffer.

I wonder if it is possible to read a file containing pdf page by page or even one line at a time.

0
source share
4 answers

For this generic PDF, you cannot find out where one page ends and the other using PDFBox.

If your problem is using resources, I suggest you parse the pdf document into COSDocument, extract the parsed objects from COSDocument using .getObjects (), which will provide you with java.util.List. It should be easy to fit into any scarce resources that you have.

Please note that you can easily convert your parsed PDFs to Lucene indexes through the PDFBox API.

In addition, before you go to the country of optimizations, make sure that you really need them. PDFBox is able to make in memory a fairly large PDF-documents without much effort.

To parse a PDF document from an InputStream, view the COSDocument class

To write lucene indexes, see the LucenePDFDocument class

For views of COSDocuments in memory, see FDFDocument

+1
source

In versions 2.0. * open the pdf as follows:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly()); 

This will allow you to configure the use of buffering memory only for the use of temporary files (without main memory) with no size limit.

This was answered here .

+1
source

Check out the PDF Renderer Java library. I tried this myself, and it seems a lot faster than PDFBox. However, I have not tried to get OCR text.

Here is an example copied from the link above that shows how to draw a PDF page in an image:

  File file = new File("test.pdf"); RandomAccessFile raf = new RandomAccessFile(file, "r"); FileChannel channel = raf.getChannel(); ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size()); PDFFile pdffile = new PDFFile(buf); // draw the first page to an image PDFPage page = pdffile.getPage(0); //get the width and height for the doc at the default zoom Rectangle rect = new Rectangle(0,0, (int)page.getBBox().getWidth(), (int)page.getBBox().getHeight()); //generate the image Image img = page.getImage( rect.width, rect.height, //width & height rect, // clip rect null, // null for the ImageObserver true, // fill background with white true // block until drawing is done ); 
-1
source

I would suggest that you can read a file byte by byte looking for page breaks. Line by line is more complicated due to possible PDF formatting problems.

-2
source

Source: https://habr.com/ru/post/919534/


All Articles