IOS PDF for easy text parsing

I completely lost this topic. I read almost all the posts about it here, so I would really appreciate it if someone pushed me in the right direction.

I have a PDF, and I would like to extract it, I am only interested in words and spaces. I installed CGPDFScanner and callback methods. I read that I only need to consider 4 operators TJ, Tj, qout (') and doubleqout (") before receiving the text.

I think I also need to track the text space in order to be able to determine whether the letters should be combined to form a word or should be separated by a space. But I have no idea how I need it.

In PDF, all text is in

[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ 

but I was not able to figure out (using the PDF specification) what these numbers mean. Someone at SO said you shouldn't be afraid of PDF specifications, but frankly, I don't find them very easy to read / understand.

I studied the PDFKitten code, which was useful.

Any help would be greatly appreciated.

+2
source share
1 answer

I can not give you advice on how to extract words from PDF, but format

 [(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ 

explained, for example, in PDF Specification 1.7 , section "9.4.3 Text Display Operators". Description of the TJ statement:

Show one or more text lines to position individual glyphs. Each element of the array must be either a string or a number. If element is a string, this statement should show the string. If it is a number, the operator must adjust the position of the text by this amount; that is, it must transform the text matrix Tm. The number must be expressed in thousandths of the text space.

Thus, the numbers adjust the distance between the letters.

+6
source

Source: https://habr.com/ru/post/1438602/


All Articles