Handling (reassigning) missing / problematic (CID / CJK) fonts to PDF using ghostscript?

In short, I am dealing with a problematic PDF, which:

  • Cannot be fully displayed in a document viewer, such as evince , due to lack of font information;
  • However - ghostscript can fully display the same PDF file.

This way - regardless of what ghostscript uses to fill in the gaps (maybe reverse glyphs or some other font access method) - I would like to be able to use ghostscript to create ("distill") the output PDF, where practically nothing will be changed, except for the added evince information, so evince can display the same document in the same way as ghostscript .

My question is - is it even possible; and if so, what will be the command line to achieve something like this?

Thanks a lot in advance for any answers,
Hurrah!


Details:

Actually, I am on earlier Ubuntu 10.04, and I may experience - not an error, but a problem with installing evince (lack of poppler-data package), as indicated in Error # 386008 "Some fonts do not appear due to" Unknown font tag ... ": Errors: package" poppler ": Ubuntu .

However, this is exactly what I would like to process, so I will use fontspec.pdf attached to this post (" PDF that launches the error. ",  /  /  v.) to demonstrate a problem.

evince

First, I open this pdf page in evince ; and evince complains:

 $ evince --page-label=3 fontspec.pdf Error: Missing language pack for 'Adobe-Japan1' mapping Error: Unknown font tag 'F5.1' Error (7597): No font in show Error: Unknown font tag 'F5.1' Error (7630): No font in show Error: Unknown font tag 'F5.1' Error (7660): No font in show Error: Unknown font tag 'F5.1' ... 

The result is as follows:

evince-pdf-missfont-render.png

... and it’s obvious that some forms of fonts are missing.

Adobe acroread

Just notice how Adobe Acrobat Reader for Linux behaves; following command line:

 $ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf 

... does not generate output to the terminal (for more on /a switch, see Man page acroread ) - and the program has absolutely no problems with displaying fonts.

<sub> In addition, although I would like to avoid going back to the postscript, however, note that acroread itself can be used to convert PDF to postscript:

 $ ./Adobe/Reader9/bin/acroread -v 9.5.1 $ ./Adobe/Reader9/bin/acroread -toPostScript \ -rotateAndCenter -choosePaperByPDFPageSize \ -start 3 -end 3 \ -level3 -transQuality 5 \ -optimizeForSpeed -saveVM \ fontspec.pdf ./ 

Again, the above command line will not generate output to the terminal; -optimizeForSpeed -saveVM is, because, apparently, they deal with fonts; the final argument ./ is the output directory (the output file is automatically called fontspec.ps ).

Now evince can display previously missing fonts in the output of fontspec.ps , but complains again:

 $ evince fontspec.ps GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1 GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1 ... 

... and besides, all the text seems to be smoothed out with curves in the postscript, so now you can no longer select the text in the .ps file in evince (note that the .ps file cannot be opened in acroread ). However, you can convert this .ps back to .pdf again:

 $ pstopdf fontspec.ps # note, `pstopdf` has no output filename option; # it will automatically choose 'fontspec.pdf', # and overwrite previous 'fontspec.pdf' in # the same directory 

... and now the text in pstopdf output can be selected in evince , all fonts are there, and evince no longer complains. However, as I already noted, I would like to avoid accessing postscript files altogether. Sub>

display (from imagemagick )

We can also monitor the page in the same document using imagemagick display (note that panning an image from the command line using 'display' is apparently not yet available, so I used -crop below to configure the viewport):

 $ display -density 150 -crop 740x450+280+200 fontspec.pdf[2] **** Warning: considering '0000000000 00000 n' as a free entry. ... **** This file had errors that were repaired or ignored. **** The file was produced by: **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<< **** Please notify the author of the software that produced this **** file that it does not conform to Adobe published PDF **** specification. 

... which generates some ghostscrip ish errors - and leads to something like this:

imagemagick-display-pdf.png

... where it is obvious that the missing fonts that evince cannot display are now shown here, with the imagemagick display , correctly.

ghostscript

Finally, we can use ghostscript as an x11 viewer - observe the same page, the same document:

 $ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \ -c '<</PageOffset [-120 520]>> setpagedevice' \ -f fontspec.pdf GPL Ghostscript 9.02 (2011-03-30) Copyright (C) 2010 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. Processing pages 3 through 74. Page 3 >>showpage, press <return> to continue<< ^C 

... and the result with this output:

ghostscript-pdf-view.png

In conclusion: ghostscript (and , apparently by extension, imagemagick ) can apparently find the missing font (or at least some replacement for it), and display a page with this - even if evince fails at the same time for the same document.

So I just would like to export the PDF version from ghostscript , which will contain only missing fonts and other processing; so i try this:

 $ gs -dBATCH -dNOPAUSE -dSAFER \ -dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \ -dAutoFilterMonoImages=false \ -dAutoFilterGrayImages=false \ -dAutoFilterColorImages=false \ -dDownsampleColorImages=false \ -dDownsampleGrayImages=false \ -dDownsampleMonoImages=false \ -sDEVICE=pdfwrite \ -dFirstPage=3 -dLastPage=3 \ -sOutputFile=mypg3out.pdf -f fontspec.pdf GPL Ghostscript 9.02 (2011-03-30) Copyright (C) 2010 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. **** Warning: considering '0000000000 00000 n' as a free entry. Processing pages 3 through 3. Page 3 **** This file had errors that were repaired or ignored. **** The file was produced by: **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<< **** Please notify the author of the software that produced this **** file that it does not conform to Adobe published PDF **** specification. 

... but it does not work - the output file mypg3out.pdf suffers from the same problems in evince , as noted earlier.

Note. Although I would like to avoid the postscript browser, a good example of the gs command line from pdf to ps with font insertion is: (# 277826) pdf - How to make GhostScript PS2PDF terminate a subset of fonts ; but the same command line switches to .pdf to .pdf so as not to affect the problem described above.

+6
source share
2 answers

OK point 1; You CANNOT use Ghostscript and pdfwrite to create a PDF file without additional processing.

The way pdfwrite and Ghostscript work is to fully interpret the incoming data (PostScript, PDF, XPS, PCL, etc.), creating a series of graphic primitives that are transferred to the pdfwrite device. The pdfwrite device then reassembles them into a new PDF file.

Thus, it is impossible to take a PDF file as input and manipulate it, it will always create a new file.

Now I suggest you upgrade 9.02 Ghostscript to 9.05 to get started. The lack of CIDFonts handles much better at 9.05 (and will be improved at 9.06 this year). (The font that you miss Osaka Mono is actually CIDFont, not the regular font)

Using the current edge of the bloodstream, Ghostscript code creates for me a PDF file that has a missing font. I can’t say if this will work for you because my copy of evince renders the source file perfectly.

Added later

Studying the original PDF, I see that the fonts are really embedded there (as you would expect, since they are subsets). Thus, in fact, as you say in your own answer above, the problem is not in embedding fonts, but in using CIDFonts.

My answer here will not help you, since pdfwrite will still output CIDFont on output. This is mainly a flaw in your version or installation of evince.

The problem with reassigning characters is that the font is limited to 256 glyphs, and CIDFont has no limit. Therefore, it is not possible to set CIDFont to a font. The only way to do this is to create several fonts, each of which contains part of the original, and then switch between them as needed. Slow and clanked.

If you convert to PostScript using the ps2write device, then it will do it for you, but you risk that in this process it will convert vector glyph data to bitmaps that will not scale well.

In fact, you cannot really achieve what you want (convert 1 CIDFont to N regular fonts) with Ghostscript, or, in fact, with any other tool I know about. Although technically possible, it makes no real sense, since all PDF users should be able to handle CIDFonts. If they cannot do this, this is a mistake for the PDF user.

+3
source

That's right, I worked a bit on this (but not completely), so I will post a partial response / comment here.

In fact, this is not a problem embedding fonts in PDF - this is a problem with displaying fonts.

To show this, analyze mypg3out.pdf , which was extracted by gs in the OP (from the third page of fontspec.pdf ):

 $ pdffonts mypg3out.pdf name type emb sub uni object ID ------------------------------------ ----------------- --- --- --- --------- Error: Missing language pack for 'Adobe-Japan1' mapping CAAAAA+Osaka-Mono-Identity-H CID TrueType yes yes yes 19 0 GBWBYF+CMMI9 Type 1C yes yes yes 28 0 FDFZUN+Skia-Regular_wght13333_wdth11999 TrueType yes yes yes 16 0 ZRLTKK+Optima-Regular TrueType yes yes yes 30 0 ZFQZLD+FPLNeu-Bold Type 1C yes yes yes 8 0 DDRFOG+FPLNeu-Italic Type 1C yes yes no 22 0 HMZJAO+FPLNeu-Regular Type 1C yes yes yes 10 0 RDNKXT+FPLNeu-Regular Type 1C yes yes yes 32 0 GBWBYF+Skia-Regular_wght13333_wdth11999 TrueType yes yes no 26 0 

As the conclusion shows, all fonts are indeed built-in; so another problem. (It would be harder to watch this in full fontspec.pdf , as there are a ton of fonts and a ton of error messages.)

The key point (I think) here is that there is:

  • only one message " Error: Missing language pack for 'Adobe-Japan1' mapping "; and
  • only one CID TrueType font that CAAAAA+Osaka-Mono-Identity-H

There seems to be an obvious connection between the CID TrueType display error and 'Adobe-Japan1'; and I got this finally clarified CID fonts - How to use Ghostscript :

CID fonts are PostScript resources that contain a large number of glyphs (for example, glyphs for Far Eastern languages, Chinese, Japanese, and Korean). For more information, see the PostScript Language Reference, third edition.

CID Font Resources is another type of PostScript resource from fonts. In particular, they cannot be used as regular fonts. CID font resources must first be combined with a CMap resource that defines specific codes for glyphs before it can be used as a font. This allows you to reuse a collection of glyphs with different encodings.

Everything is fine - except here we are dealing with PDF fonts, not PostScript fonts; let it demonstrate a little.

For example, 5.3. Using Ghostscript to Preview Fonts - Creating Fonts in Ghostscript - Font HowTo describes how an installed Ghostscript script called prfont.ps can be used to display a font table.

However, it would be easier here using the Ghostscript Fonts [gs-devel] List and using the resourcestatus operator to request a specific font - which does not require a special .ps script:

 $ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \ -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' ... Processing pages 1 through 1. Page 1 URWAntiquaT-RegularCondensed Palatino-Italic Hershey-Gothic-Italian ... $ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \ -c '/TimesNewRoman findfont pop [/TimesNewRoman /Font resourcestatus]' .... Processing pages 1 through 1. Page 1 Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT. Can't find (or can't open) font file TimesNewRomanPSMT. Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT. Can't find (or can't open) font file TimesNewRomanPSMT. Querying operating system for font files... Loading TimesNewRomanPSMT font from /usr/share/fonts/truetype/msttcorefonts/times.ttf... 2549340 1142090 3496416 1237949 1 done. 

We have a list of fonts; however, these are system fonts available by ghostscript - not fonts embedded in PDF!

( In principle,

  • gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka will not return anything, but
  • -c '/CAAAAA+Osaka-Mono-Identity-H findfont pop [/CAAAAA+Osaka-Mono-Identity-H /Font resourcestatus]' conclude: "I did not find this font in the system! Substituting Courier for CAAAAA + Osaka-Mono -Identity-H. " )

To list the fonts in PDF, you can use the pdf_info.ps script file from Ghostscript (not installed in the sources):

 $ wget "http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/toolbin/pdf_info.ps" -O pdf_info.ps $ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsNeeded pdf_info.ps ... No system fonts are needed. $ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsUsed -dShowEmbeddedFonts pdf_info.ps ... Font or CIDFont resources used: CAAAAA+Osaka-Mono DDRFOG+FPLNeu-Italic FDFZUN+Skia-Regular_wght13333_wdth11999 GBWBYF+CMMI9 GBWBYF+Skia-Regular_wght13333_wdth11999 GTIIKZ+Osaka-Mono HMZJAO+FPLNeu-Regular RDNKXT+FPLNeu-Regular ZFQZLD+FPLNeu-Bold ZRLTKK+Optima-Regular 

So, we can observe CAAAAA+Osaka-Mono in Ghostscript - although I don't know how to request more specific information about this from ghostscript .

In the end, I think my question boils down to: how can ghostscript be used to map glyphs from the built-in CID font to a font with a different “encoding” (or “character map”?) That does not require additional language files?

Adding

<sub> I also experimented with these approaches:

  • pdffonts will not list Osaka-Mono here, but it will still complain "Error: There is no language pack for" matching Adobe-Japan1 ":
      $ wget http://whalepdfviewer.googlecode.com/svn/trunk/cmaps/japanese/Adobe-Japan1-UCS2
     $ gs -sDEVICE = pdfwrite -o mypg3o2.pdf -dBATCH -f mypg3out.pdf Adobe-Japan1-UCS2 
  • the same as before - this (through Ghostscript "Use.htm") also makes Osaka-Mono from the pdffonts list:
      gs -sDEVICE = pdfwrite -o mypg3o2.pdf -dBATCH \
     -c '/ CIDSystemInfo << / Registry (Adobe) / Ordering (Unicode) / Supplement 1 >>' \
     -f mypg3out.pdf 
  • this is crashing with Error: /undefinedresource in findresource :
      gs -sDEVICE = pdfwrite -o mypg3o2.pdf -dBATCH \
     -c '/ Osaka-Mono-Identity-H / H / CMap findresource [/ Osaka-Mono-Identity / CIDFont findresource] ==' \
     -f mypg3out.pdf 

Please note that some of the gps script scripts are installed; it can be used automatically; for example you can find gs_ttf.ps :

 $ locate gs_ttf.ps /usr/share/ghostscript/9.02/Resource/Init/gs_ttf.ps 

... and then using sudo nano locate gs_ttf.ps , you can add the instruction (Hello from gs_ttf.ps\n) print at the beginning of the code; then whenever one of the above gs commands is called, the printout will be visible in stdout.
Sub>

References

+3
source

Source: https://habr.com/ru/post/918425/


All Articles