Extract text between two bookmarks using Apache PdfBox

Question

Extract text between two bookmarks using Apache PdfBox

I am using Apache PDFBox to read a PDF document that has a bookmark-defined hierarchy. The hierarchy is in the form of a tree with content only at the sheet level.

Extract text between two sheet-level tabs using the following code:

Stripper.setStartBookmark(), Stripper.setEndBookmark(), Stripper.writeText()),

Instead, returns text throughout the page. In short, my problem is similar to the problem mentioned in this thread .

Is there a way to extract content between two bookmarks?

If so, what should be the change in my code?

+6

java pdf pdfbox

Shriram Kalpathy Mohan Mar 6 '12 at 7:21

source share

1 answer

maffo · Answer 1 · 2013-02-04T07:30:57+0000

I assume that your bookmark does not contain the correct data.

It appears that the bookmark you are using only points to the page where your content begins, and not the location on the page.

Here is an example of a bookmark containing location data:

 <Title Action="GoTo" Style="bold" Page="2 FitH 518"> Title Name </Title>

Extract text between two bookmarks using Apache PdfBox

More articles: