Remove some embedded pdf fonts created using pdfTk

Is there a way to remove fonts embedded several times from a pdf file?

This is my scenario:

1) the program generates several one-page reports in pdf format (requests for db, placing information on an excel template and exporting formatted information in pdf)

2) pdftk combines single-page PDF files into one file.

Everything works fine, but the size of the resulting pdf is very large: in fact, I noticed that the fonts are embedded several times (as many times as the page number: all pages are generated starting from the same excel template, the fonts are embedded in one pdf file, and pdftk just glues pdf). Is there a way to crop only one copy of each embedded font?

I tried to embed fonts only on the first page when exporting from excel-> pdf: the file size decreases sharply, but it seems that other pages do not have access to embedded fonts.

Thanks Alessandro

+6
source share
2 answers

You can try to "restore" your PDF file using pdftk using Ghostscript (but use the latest version, for example, 9.05). In many cases, Ghostscript will be able to combine many subsets of fonts into several.

The command will look like this:

gswin32c.exe ^ -o output.pdf ^ -sDEVICE=pdfwrite ^ -dPDFSETTINGS=/prepress ^ input.pdf 

Check with

 pdffonts.exe output.pdf pdffonts.exe input.pdf 

the number of instances of different subsets of fonts in each file ( pdffonts.exe available here as part of a small command-line tool package ).

But do not complain about the "slow speed" of this process. Ghostscript fully interprets all input PDF files to accomplish its task, and concatenating pdftk files is a much simpler process ...


Update:

Instead of pdftk you can use Ghostscript to merge your input PDF files. This could have avoided the problem that you saw using a posteriori Ghostscript to β€œrepair” your pdftk files. Note that this will be much slower than the "silent" pdftk merge. However, you may like the results, especially with regard to font handling and file size.

This would be a possible command:

 gswin32c.exe ^ -o output.pdf ^ -sDEVICE=pdfwrite ^ -dPDFSETTINGS=/prepress ^ input.pdf 

You can add additional options to the Ghostscript CLI for finer control over the merge and optimization process.

In the end, you have to decide between extremes:

  • 'Fast' pdftk creating large output files, vs.
  • "Slow" gswin32c.exe (Ghostscript) creating files with a clear output file.

I would be interested if you would publish some results (runtime and resulting file sizes) for both methods for several merge processes ...


Update 2: Sorry, my previous version contains a typo.
This is not -sPDFSETTINGS=... , but it should be -dPDFSETTINGS=... ( d instead of s).


Update 3:

Since your source files are Excel worksheets created from templates (which usually don't use many different fonts), you can try the trick to make sure that Ghostscript has all the necessary glyphs for the fonts used in all, -be-merged-later PDFs:

  • For each font and face (standard, italics, bold, bold, italics) add a table cell to the template sheet in the upper left corner of the print area. A.
  • Fill this table cell with all printed and punctuation characters from the ASCII alphabet: 0123456789 , ABCD...XYZ , abc...xyz :-_;Β°%&$Β§")({}[] , etc.
  • Make the cell (and font) as small as you want or want, so as not to interfere with your layout. Use white to format the characters in the cell (so that they look invisible in the final PDF file).

This method, we hope, will make sure that each of your PDF files uses the same glyph subset, which then avoids the problems that you observed when merging files with Ghostscript. (Note that if you use fe Arial and Arial-Italic, you need to create 2 such cells: one is formatted with the standard Arial font, the other with italics.)

+4
source

Fonts are usually a subset of PDF files, so they contain only the glyphs you need. In addition, the encoding is changed in such a way that the character code 1 is assigned to the first glyph, the second to code 2, etc.

As a result, the first PDF file may contain a font, where 0x01 = A, 0x02 = space, 0x03 = t, 0x04 = e and 0x05 = s. The second file may contain a font, where 0x01 = T, 0x02 = e, 0x03 = s, 0x04 = t

In order not to get confused, a prefix is ​​added to the font name in the document. This prefix loses Acrobat when displaying a font attachment, so it seems that you have multiple instances of the same font. However, they are actually different fonts and cannot be easily combined.

Assuming this is the case (and I will need to make sure your files are required), this may be avoided. If you install the PDF production software so that it does not multiply fonts, then pdftk could combine documents without including the same font several times. I have not tested this explicitly, but it might work. Another option is to change the workflow so that reports are created as documents with multiple pages.

+3
source

Source: https://habr.com/ru/post/915943/


All Articles