Here's the source (java, sorry) for readPages:
protected internal void ReadPages() { catalog = trailer.GetAsDict(PdfName.ROOT); rootPages = catalog.GetAsDict(PdfName.PAGES); pageRefs = new PageRefs(this); }
trailer, catalog , rootPages , and pageRefs` are all PdfReader member variables.
If the trailer or the root / catalog object in PDF format is simply missing, your PDF is REALLY FAST. Most likely, the xref table is a bit off, and the objects in question are simply not exactly where they should be (which is bad, but can be restored).
HOWEVER, when PdfReader first opens a PDF file, it parses ALL the objects in the file and converts them to the appropriate classes based on PdfObject.
What he does not do is check that the number of the object declared in the xref table and the number of the object read from the file "Match itself". Very unlikely, but possible. Poor software may write its PDF objects in the wrong order, but save byte offsets in the xref table correctly. Software that overrides the object number from the xref table with the number from this particular byte offset in the file will be fine.
iText is wrong.
I still want to see the PDF.
Yeah. This pdf file is corrupt. In particular:
The file first 70kb or so defines a fairly clean small PDF. Then changes were added to the PDF.
Check this. Someone tried to add changes to the PDF and failed. Bad. To understand how bad it is, let me explain some of the internal PDF syntax illustrated in this example:
%%PDF1.6 1 0 obj <</Type/SomeObject ...>> endobj 2 0 obj <</Type/SomeOtherObj /Ref 1 0 R>> endobj 3 0 obj ... endobj <etc> xref 0 10 0000000000 65535 f 0000000010 00001 n 0000000049 00002 n 0000000098 00003 n ... trailer <</Root 4 0 R /Size 10>> startxref 124 %%EOF
So, we have the header / version "%% PDF1.v", a list of objects (those called dictionaries here), a cross (x) look-up table that lists the byte offsets and object numbers of all objects in a list and a trailer giving the root object and the number of objects in the PDF, and the byte offset by "x" in "xref".
You can add changes to an existing PDF. To do this, you simply add new or changed objects after the existing %% EOF, cross-referencing these new objects and the trailer. The trailer for the added change should include the / Prev switch with a byte offset in the previous cross-reference table.
In your pdf NOT-OKAY file, someone has tried to add changes to the PDF, AND UNFINISHingly CONVENIENT .
The original PDF file still exists, intact. This is what the reader shows you and what you get when you save the PDF. I deleted everything after the first %% EOF in a hex editor, and the file was fine.
So, here is the layout of your NOT-OKAY pdf:
%PDF1.4.1 1 0 obj... 2 through 7 xref 0 7 <healthy xref> trailer <</Size 8 /Root 6 0 R /Info 7 0 R>> startxref 68308 %%EOF
So far so good. Here where things get ugly
<binary garbage> endstream endobj xref 0 7 <horribly wrong xref> trailer <</ID [...] /Info 1 0 R /Root 2 0 R /Size 7>> startxref 223022 %%EOF
The only thing related to this section is the value of startxref.
Problems:
- The second trailer does not have a / Prev switch.
- All byte offsets in the second link table are invalid.
- This is part of the stream object, but the beginning of this object is IS MISSING. Streams should look something like this.
1 0 obj <</Type/SomeType/Length 123>> stream 123 bytes of data endstream endobj
The end of this file consists of some part (compressed, I suppose) of the stream ... but without the dictionary at the beginning telling us what filters its use and how long it (not to mention any missing data), you can do nothing with this to do.
I suspect that someone tried to completely rebuild this PDF file, and then accidentally wrote the original 70 kb at the beginning of their version. Kaboom.
It seems like Adobe is simply ignoring the bad changes applied. iText can do this too, but you can also:
When iText does not open the PDF file:
1. Search backwards through the file, looking for the second second of %%EOF . Ignore the one that is at the very end, we want the previous state of the file. 2. Delete everything after the second-last %%EOF (if any) and try opening it again.
The sad thing is that this broken PDF could be completely different from the "original" 70kb, and then some input / output error is sewn up by the first part of the file. Hardly, but there is no way to be sure.