PDF cross-references

Question

PDF cross-references

I am developing a parser / writer for PDF, but I am fixated on creating cross-references. My program reads this file, and then removes the linearization and decompresses all the objects in the object streams. Finally, he creates a PDF file and saves it.

This works very well when I use the usual cross-link and trailer, as you can see in this file.

When I try to generate an object with a cross-reference stream (which leads to this file, Adobe Reader cannot view it.

Does anyone have any experience with a PDF file and can help me find what the problem is?

Note that the cross reference is ONLY the difference between file 2 and file 3. The first 34127 bytes are the same.

If someone needs the content of the decoded link stream, download this file and open it in the HEX editor. I checked this lookup table again and again, but I could not find anything wrong. But the dictionary seems to be fine too.

Many thanks for your help!!!

Update

Now I completely solved the problem. Here you can find the new PDF.

+4

pdf pdf-generation pdf-parsing

Van coding Dec 29 '10 at 17:30

source share

2 answers

"resultstream.pdf" does not have a valid cross-reference stream.

If I open it in my viewer, it will try to read the "13 0" object as a cross-reference stream, but its simple dictionary (there are no stream tags and data).

A little bit on the topic: what language do you develop in? At least in Java, three valuable options are known (PDFBox, iText and jPod, where I personally select jPod as one of the developers, a very clean implementation :-). If this is not suitable for your platform, perhaps you can at least take a look at the algorithms and data structures.

EDIT

Good - if "resultstream.pdf" is the document in question, this is what my editor sees (SCITE)

 ... 13 0 obj <</Size 0/W [1 2 0]/Type /XRef/Root 8 0 R>> endobj startxref 34127 %%EOF

There is no flow.

0

mtraut Dec 29 '10 at 18:04

source share

Mark storer · Accepted Answer · 2010-12-30T00:58:06+0000

Two problems that I see (not looking at the stream data.

" Size integer (required) Number one is greater than the highest object number used in this section, or in any section for which this will be an update. It should be equivalent to writing the size in the trailer dictionary."
your size should be ... 14.
" Index (optional) An array containing a pair of integers for each subsection in this section. The first integer must be the first number of the object in the subsection, the second integer records in the subsection The array is sorted in ascending order by the number of the object. The subsections cannot intersect; the object number may contain no more than one entry per section. Default value: [0 Size]. "
Your index should probably skip a bit. You have no 2-4 or 7. objects. The index array should reflect this.
Your data is also wrong (and I just learned to read the xref stream.).

00 00 00 01 00 0a 01 00 47 01 01 01 01 01 70 01 02 fd 01 76 f1 01 84 6b 01 84 a1 01 85 4f

According to these data, which, due to your “index no”, are interpreted as object numbers from 0 to 9, they have the following offset:

0 is not used. Fine
1 is at 0x0a. Yes, of course, 2 is at 0x47. Nope. This approaches the start of the “1 0” stream. This is probably not a coincidence.
3 is at 0x101. Nope. 0x101 is still in the stream "1 0".
4 is located at 0x170. Also
5 is at 0x2fd. Also
6 is at 0x76f1. No, and this time buried inside this stream of images.

I think you get the point. Thus, even if you had the correct \ Index, your offsets are all wrong (and completely different from what in resultNormal.pdf, even considering random defragmentation).

What do you want to find in resultNormal xref:

xref 0 2 0000000000 65535 f 0000000010 00000 n 5 2 0000003460 00000 n 0000003514 00000 n 8 5 0000003688 00000 n 0000003749 00000 n 0000003935 00000 n 0000004046 00000 n 0000004443 00000 n

So, your index should be (if I read this right): \ Index [0 2 5 2 8 5]. And the data:
0 0 0
1 0 a
1 3460 (this is a decimal number)
1 3514 (same)
1 3688
etc.

Interestingly, the PDF specification says that the size should be BOTH the number of entries in this and all previous XRefs and number one is higher than the highest object number that is used.

I don’t think the later part is ever applied, but I won’t be surprised if I find that xref streams are more persistent than regular cross-reference tables. Maybe it will be the same code processing, maybe not.

@mtraut:

Here is what I see:

13 0 obj
<</Size 10/Length 44/Filter /FlateDecode/DecodeParms <</Columns 3/Predictor 12>>/W [1 2 0]/Type /XRef/Root 8 0 R>>
flow
...
endstream
endobj

PDF cross-references

More articles: