Removing PDFID to PDF

im using iText to convert xhtml to pdf. After that, I create the md5 checksum of the created pdf to store only new / changed files.

each created file contains PdfID0 and PdfID1, which look like hashes.

What is a "hash" for? and how to remove them?

im using the following code from the iText package to change metainfos:

com.lowagie.text.pdf.PdfReader reader = new PdfReader(pdfPath); com.lowagie.text.pdf.PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(tempFile)); HashMap<String, String> hMap = reader.getInfo(); hMap.put("Title", "MyTitle"); hMap.put("Subject", "Subject"); hMap.put("Keywords", "Key, words, here"); hMap.put("Creator", "me"); hMap.put("Author", "me"); hMap.put("Producer", "me"); hMap.put("CreationDate", null); hMap.put("ModDate", null); hMap.put("DocChecksum", null); stamper.setMoreInfo(hMap); stamper.close(); 

and extracted file metafiles using pdftk:

 InfoKey: Creator InfoValue: me InfoKey: Title InfoValue: MyTitle InfoKey: Author InfoValue: me InfoKey: Producer InfoValue: me InfoKey: Keywords InfoValue: Key, words, here InfoKey: Subject InfoValue: Subject PdfID0: 28c71a8d7790a4d3e85ce879a90dec0 PdfID1: 4c5865d36c7a381e6166d5e362d0aafc NumberOfPages: 1 

thanks for any tips

+4
source share
2 answers

Regarding identifiers ... The pdf specification says:

File identifiers should be determined by an optional identifier entry in the PDF file trailer dictionary (see 7.5.5, “File Trailer”). The identifier is optional, but should be used. The value of this entry should be an array of two byte strings. The first line of the byte should be a constant identifier based on the contents of the file at the time of its initial creation and should not change when the file is gradually updated. The second line of the byte should be a changing identifier based on the contents of the files at the time of its last update. When a file is first written, both identifiers must have the same value. If both identifiers match when the file link is resolved, it is very likely that the correct and immutable file was found. If only the first identifier is found, another version of the correct file was found.

This means identifiers are optional but recommended.

IText automatically inserts and updates identifiers. You can, of course, change iText (it is still open source) so as not to.

+1
source

What you see marked as PdfID0 and PdfID1 on the pdftk metadata is part of the following PDF trailer code at the end of the corresponding PDF file (example):

 trailer << /Size 32 /Root 24 R /Info 19 R /ID [ <28c71a8d7790a4d3e85ce879a90dec0> <4c5865d36c7a381e6166d5e362d0aafc> ] >> startxref 81799 %%EOF 

An entry /ID in the trailer dictionary is only required if an Encrypt entry is present; otherwise, this is an optional key.

It is described by the PDF specification as:

"An array of two byte lines constituting the file identifier (see 14.4,“ File Identifiers ") for the file. If there is an encryption record, this array and two byte lines must be direct objects and must be unencrypted."

and besides:

"The first byte line must be a constant identifier based on the contents of the file at the time of its initial creation and does not change when the file is updated. The second byte line must be a changed identifier based on the contents of the file at the time it was last updated. When the file is first written, both identifiers must have if both identifiers coincide, when the link to the file is resolved, it is very likely that the correct and immutable file was If only the first identifier matches, b la found another version of the correct file. "

And this is NOT an optional hash . Here is what the ISO PDF specification offers (not “prescribes"):

"To ensure that file identifiers are unique, they must be computed using a message digest algorithm such as MD5 (described in Internet RFC 1321, MD5 Message-Digest Algorithm, see Bibliography)

  • Current time
  • A string representation of the file location, usually this is the path
  • File size in bytes
  • Values ​​of all entries in the file information dictionary (see 14.3.3, “Document Information Dictionary”)

There are a few spots in the generated PDF files that may change with each new run. These keys are in the document information dictionary ( /Info specified in the trailer)

  • /CreationDate
  • /ModDate

can be updated every time you create or modify a PDF.

Therefore, using your own MD5 checksum over the released PDF to check for new / changed files will not work unless you make sure that you at least "normalize" /CreationDate and /ModDate , as well as /ID before creating the MD5 hash.


Update: As user mkl correctly noted in the commentary on this answer, the /CreationDate and /ModDate of the /Info dictionary (as well as the /ID information) usually have equivalent pieces of information contained in the XML metadata embedded in the PDF. You can display the full XML metadata using the pdfinfo utility, for example:

 pdfinfo -meta your.pdf 
+6
source

Source: https://habr.com/ru/post/1443631/


All Articles