What is the ID field in a PDF file?

Question

What is the ID field in a PDF file?

I am working on improving the pdf scrubber in the ApprovedTests environment and watching a simple pdf file created using PdfSharp. I can see that this content looks like this.

Does anyone know what is the ID field at the bottom?

%PDF-1.4 %ÓôÌá 1 0 obj << /CreationDate(D:20131119194420-06'00') /Creator(PDFsharp 1.32.3057-g \(www.pdfsharp.net\)) /Producer(PDFsharp 1.32.3057-g \(www.pdfsharp.net\)) >> endobj 2 0 obj << /Type/Catalog /Pages 3 0 R >> endobj 3 0 obj << /Type/Pages /Count 1 /Kids[4 0 R] >> endobj 4 0 obj << /Type/Page /MediaBox[0 0 612 792] /Parent 3 0 R /Contents 5 0 R /Resources << /ProcSet [/PDF/Text/ImageB/ImageC/ImageI] /ExtGState << /GS0 6 0 R >> /Font << /F0 8 0 R >> >> /Group << /CS/DeviceRGB /S/Transparency /I false /K false >> >> endobj 5 0 obj << /Length 99 /Filter/FlateDecode >> stream xœŠI €@ïyE¼)¸ÄŒ^—«ðŽ 2"êÍ×)ènšº ER¢¿ÊŠq>t¡¼pA-t#áö@ÒªÄú¯À†ã¢R7#ç(ý~qîq:og½ endstream endobj 6 0 obj << /Type/ExtGState /ca 1 >> endobj 7 0 obj << /Type/FontDescriptor /Ascent 1005 /CapHeight 727 /Descent -210 /Flags 32 /FontBBox[-550 -303 1707 1072] /ItalicAngle 0 /StemV 0 /XHeight 548 /FontName/Verdana,Bold >> endobj 8 0 obj << /Type/Font /Subtype/TrueType /BaseFont/Verdana,Bold /Encoding/WinAnsiEncoding /FontDescriptor 7 0 R /FirstChar 0 /LastChar 255 /Widths[1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 341 402 587 867 710 1271 862 332 543 543 710 867 361 479 361 689 710 710 710 710 710 710 710 710 710 710 402 402 867 867 867 616 963 776 761 723 830 683 650 811 837 545 555 770 637 947 846 850 732 850 782 710 681 812 763 1128 763 736 691 543 689 543 867 710 710 667 699 588 699 664 422 699 712 341 402 670 341 1058 712 686 699 699 497 593 455 712 649 979 668 650 596 710 543 710 867 1000 710 1000 332 710 587 1048 710 710 710 1777 710 543 1135 1000 691 1000 1000 332 332 587 587 710 710 1000 710 963 593 543 1067 1000 596 736 341 402 710 710 710 710 543 710 710 963 597 849 867 479 963 710 587 867 597 597 710 721 710 361 710 597 597 849 1181 1181 1181 616 776 776 776 776 776 776 1093 723 683 683 683 683 545 545 545 545 830 846 850 850 850 850 850 867 850 812 812 812 812 736 734 712 667 667 667 667 667 667 1018 588 664 664 664 664 341 341 341 341 679 712 686 686 686 686 686 867 686 712 712 712 712 650 699 650] >> endobj xref 0 9 0000000000 65535 f 0000000015 00000 n 0000000180 00000 n 0000000228 00000 n 0000000283 00000 n 0000000538 00000 n 0000000707 00000 n 0000000750 00000 n 0000000935 00000 n trailer << /ID[<48189AA5E6D2394D8EF6E7842493B4A9><48189AA5E6D2394D8EF6E7842493B4A9>] /Info 1 0 R /Root 2 0 R /Size 9 >> startxref 2167 %%EOF

+6

pdf

George mauer Nov 20 '13 at 1:53

source share

3 answers

In accordance with this article :

 4. Append the file identifier (the /ID entry from the trailer dictionary). This is an arbitrary string of bytes; Adobe recommends that it be generated by MD5 hashing various pieces of information about the document.

This spoke of encrypting PDF files. In accordance with this article, an identifier is needed only during encryption:

 a program that makes PDF files is only required to create the file identifier if the file is to be encrypted.

This SO link also contains some good information. It states that the identifier must be unique enough and gives a specific ISO number to search for additional information.

+1

Millie smith Nov 20 '13 at 2:18

source share

It appears that the trailer ID is mandatory in the PDF / A Archive standard (ISO 19005), so this may be a consideration for some PDF generators.

0

Jeff epler Apr 30 '15 at 16:25

source share

mkl · Accepted Answer · 2013-11-20T08:51:12+0000

Some notes to add to the picture from @Millie's answer:

If you are in doubt about some aspects of PDF, the ISO 32000-1 specification should come first .

It indicates the ID entry as:

ID (required if Encrypt is present, optional otherwise; PDF 1.1)
An array of two byte strings constituting the file identifier (see 14.4, “File Identifiers”) for the file. If there is an Encrypt record, this array and two byte strings must be direct objects and must be unencrypted.
NOTE 1 Since the ID records are not encrypted, you can check the ID key to ensure that you can access the correct file without decrypting the file. Restrictions that the string is a direct object and not encrypted ensure that this is possible.
NOTE 2: Although this entry is optional, its absence may interfere with file operation in some workflows that depend on unique file identifiers.
NOTE 3 The values of the ID strings are used as input to the encryption algorithm. If these lines were indirect or the ID array was indirect, these lines would be encrypted when writing. This will lead to a cyclical condition for the reader: ID strings must be decrypted in order to use them to decrypt strings, including ID strings. The previous restriction prevents this cyclic condition.
(Table 15 - Entries in the File Trailer Dictionary)

NOTE 2 above is essentially a recommendation to add this optional value, even if it is not compiled using the SHALL / SHOULD / MAY language specifications used elsewhere in this document.

The recommendation is specified in more detail in section 14.4 of the reference:

The ID is optional, but should be used.

As it should be in these specifications, this is a recommendation, and the recommendation is defined as something that needs to be done, if there is no good reason for this, it means that a writer in PDF format must create this entry if she cannot object to it requirements (I can hardly come up with arguments against this). This should answer the question asked in response to Millie

any idea why both pdfsharp and phantomjs create it?

This is not particularly considered good practice, as suggested in another comment above.

Regarding the contents of the ID array, the specification continues in section 14.4:

The value of this entry should be an array of two byte strings. The first line of the byte should be a constant identifier based on the contents of the file at the time of its initial creation and should not change when the file is gradually updated. The second line of the byte should be a changing identifier based on the contents of the files at the time of its last update. When a file is first written, both identifiers must have the same value. If both identifiers match when the file link is resolved, it is very likely that the correct and immutable file was found. If only the first identifier matches, another version of the correct file was found.
To ensure that file identifiers are unique, they must be computed using the message digest algorithm ...
The calculation of the file identifier should not be reproducible; all that matters is that the identifier is likely to be unique.

So another article cited from is also not entirely correct in saying

a program that creates PDF files is only required to create a file identifier if the file is to be encrypted.

Even in the absence of encryption, this program should have good reason not to create file identifiers, as a recommendation in the specification. Thus, in the absence of such reasons, creating a file identifier requires .

All that said, any consumer in PDF should always be ready to find a PDF without a file identifier ... maybe the reason is not to create it.

What is the ID field in a PDF file?

More articles: