PDF: Character Code & # 8594; Glyph name & # 8594; Nsstring

Following my previous questions, trying to extract text from a PDF file using CGPDF * functions, having:

CGPDFStringRef pdfString 

I realized that it can be converted to an array of character codes as follows:

 const unsigned char *characterCodes = CGPDFStringGetBytePtr(pdfString); 

Now the text I'm trying to extract is written in one of 14 basic type 1 fonts, which is not encoded in the PDF file itself. Therefore, I analyzed the corresponding AFM file for this font, giving me a mapping from the character code to the glyph name and its size as follows:

 C 61 ; WX 600 ; N equal ; B 80 138 520 376 ; C 63 ; WX 600 ; N question ; B 129 -15 492 572 ; C 64 ; WX 600 ; N at ; B 77 -15 533 622 ; C 65 ; WX 600 ; NA ; B 3 0 597 562 ; C 66 ; WX 600 ; NB ; B 43 0 559 562 ; 

My question is, knowing the character code, say: "61", how do I go from this glyph name: "equals" to NSString @ "=". Especially when this character code is reassigned to another glyph name, say, for example: "question" on the PDF font encoding option.

Previous issues: iOS PDF parsing Type 1 Metric fonts and iOS PDF for simple text analyzer

+4
source share
1 answer

I have not tested this, but it seems to me that you need to use the Adobe Glyph naming convention for this:

The purpose of the Adobe Glyph naming convention is to support the calculation of a Unicode character string from a glyph sequence. This is achieved by specifying a mapping to glyph names on a line character.

The glyphlist.txt linked on this page seems to be relevant to your problem.
Fragment example:

...
epsilon; 03B5
epsilontonos; 03AD
equals 003D
equalmonospace; Ff1d
equalsmall; FE66
equalsuperior; 207C
...

Then all you have to do is put these unicode values ​​in your NSString instance .

Edit:
Confirming the above information, I found the following explanation in Adobe PDF Reference Document , Section 5.9 - Extracting Text Content:

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding or WinAnsiEncoding, or an encoding whose array of Differences includes only character names taken from the standard Latin character set Adobe and the set of named characters in the Symbol font (see Appendix D):

  • Match the character code with the character name in accordance with Table D.1 on page 996 and an array of font differences.
  • Find the character name in the Adobe Glyph list (see Bibliography) to get the appropriate Unicode value.
+2
source

Source: https://habr.com/ru/post/1438592/


All Articles