For this generic PDF, you cannot find out where one page ends and the other using PDFBox.
If your problem is using resources, I suggest you parse the pdf document into COSDocument, extract the parsed objects from COSDocument using .getObjects (), which will provide you with java.util.List. It should be easy to fit into any scarce resources that you have.
Please note that you can easily convert your parsed PDFs to Lucene indexes through the PDFBox API.
In addition, before you go to the country of optimizations, make sure that you really need them. PDFBox is able to make in memory a fairly large PDF-documents without much effort.
To parse a PDF document from an InputStream, view the COSDocument class
To write lucene indexes, see the LucenePDFDocument class
For views of COSDocuments in memory, see FDFDocument
source share