Wkhtmltopdf generates a different checksum every time it starts

I am trying to verify that the content generated from wkhtmltopdf is the same from run to run, however every time I run wkhtmltopdf, I get a different hash / checksum value on the same page. We are saying something real, like using an html page:

<html> <body> <p> This is some text</p> </body </html> 

I get different md5 or sha256 hashes every time I run wkhtmltopdf using an amazing line:

 ./wkhtmltopdf example.html ~/Documents/a.pdf 

And using python hash code:

 def shasum(filename): sha = hashlib.sha256() with open(filename,'rb') as f: for chunk in iter(lambda: f.read(128*sha.block_size), b''): sha.update(chunk) return sha.hexdigest() 

or the md5 version that just swap sha256 with md5

Why is wkhtmltopdf generating another file sufficient to create another checksum, and is there any way to do this? Any command line that can be passed to prevent this?

I tried --default-header, -no-pdf-compression and --disable-smart-shrinking

This is the osx MAC value, but I created these pdf files on other machines and downloaded them with the same result.

wkhtmltopdf version = 0.10.0 rc2

+4
source share
4 answers

I tried this and opened the resulting PDF file in emacs. wkhtmltopdf inserts the "/ CreationDate" field into the PDF. This will be different for each run and will clog hash values ​​between runs.

I did not see the ability to disable the "/ CreationDate" field, but it would just be to remove it from the file before calculating the hash.

+2
source

I wrote a method to copy the creation date from the expected output to the current generated file. This is in Ruby, and the arguments are any classes that go around and cheat like IO:

 def copy_wkhtmltopdf_creation_date(to, from) to_current_pos, from_current_pos = [to.pos, from.pos] to.pos = from.pos = 74 to.write(from.read(14)) to.pos, from.pos = [to_current_pos, from_current_pos] end 
+1
source

I was inspired by Carlos to write a solution that does not use hard code, since in my documents the index was different from Carlos 74.

In addition, I do not have files open already. And I handle the return case earlier when CreationDate not found.

 def copy_wkhtmltopdf_creation_date(to, from) index, date = File.foreach(from).reduce(0) do |acc, line| if line.index("CreationDate") break [acc + line.index(/\d{14}/), $~[0]] else acc + line.bytesize end end if date # IE, yes this is a wkhtmltopdf document File.open(to, "r+") do |to| to.pos = index to.write(date) end end end 
0
source

We solved the problem by deleting the creation date with a simple regular expression.

 preg_replace("/\\/CreationDate \\(D:.*\\)\\n/uim", "", $file_contents, 1); 

After that, we can each time get an agreed checksum.

0
source

Source: https://habr.com/ru/post/1501508/


All Articles