Why are PDF files different from each other, even if the content is the same?

Question

Why are PDF files different from each other, even if the content is the same?

“Theres usually more than one way to create a PDF file documents that look like identical twins when opened in a PDF viewer. And even if you create two identical PDF documents using the same code, there will be slight differences between the two resulting files. This inherent to the PDF format.

I read this paragraph in "Itext in action-second edition" (p. 17). Someone please explain to me what differences the author is talking about. And the reason why there is this defect in pdf format, if I may say.

+4

pdf pdf-generation

programer8 Nov 18 '13 at 3:40

source share

2 answers

In addition to the other answers, do not forget that there are always different ways to achieve the same result in programming. Think about when HTML5 hits the scene.

 <script> alert("Hey"); </script>

compared to the older way to use JS ....

 <SCRIPT type="text/javascript"> alert("Hey"); </script>

It’s just not that there are always different ways to produce the same effect, and two different people will use two different methods. This is why the REST API was created.

0

morantis Nov 18 '13 at 15:07

source share

Bruno lowagie · Accepted Answer · 2013-11-18T08:18:38+0000

Files created at another moment have a different meaning for CreationDate and they have different file identifiers (with two files created at another moment, there must be different ID , as defined in the PDF file specification).

A file identifier is usually a hash created based on the date, path name, file size, part of the contents of the PDF file (for example, entries in the information dictionary). I quote ISO-32000-1:

The calculation of the file identifier should not be reproducible; The thing is, the identifier is likely to be unique. For example, two implementations of the previous algorithm can use different formats for the current time, forcing them to produce different file identifiers for the same file created at the same time, but the uniqueness of the identifier is not affected.

File identifiers are required when encrypting a document, as they are used in the encryption process. As a result, encrypted PDF files with different file identifiers will have streams that are completely different. This is not a flaw, it is a design. I am a member of the ISO committee that is working on the PDF 2.0 specification, and I can assure you that there are no plans to change it. Files created at another point in time will be different, even when using the same code. (I am also the author of the book you are referring to.)

The ISO specification also allows for other differences. For example: the syntax used to display graphics and text on a page may be reorganized for any reason. See Section 8.2 of ISO-32000-1 for a description of:

The important point is that there is no semantic value for the exact layout of the graphical state operators. The appropriate reader or writer of the PDF content stream can change the location of the graphical state operators to any other layout that achieves the same values for the corresponding graphics state parameters for each graphical object.

When processing the flow of PDF content, the PDF processor can change the graphics layout of the state operators to any other layout that reaches the same values for the corresponding graphic state parameters for each graphic object. This can be done to optimize the page, make it faster, to simplify debugging, improve compression, or for any other reason.

Another reason why two seemingly identical PDF files may differ internally concerns PDF dictionaries. The order of the keys in the dictionary does not matter in the PDF. Software that implements the specification for writing will, for example, use HashMap pairs for history / value. Depending on the JVM, the same code can lead to two PDF files with dictionaries that are semantically identical, but the entries are sorted differently. It's not a mistake. This is fully compliant with ISO-32000-1.

Important: The internal differences between two PDF files created using the same code, but at a different time, may not have a visual difference when opening a document in the PDF viewer or when printing a document on paper.

Why are PDF files different from each other, even if the content is the same?

More articles: