IOS PDF for easy text parsing

Question

IOS PDF for easy text parsing

I completely lost this topic. I read almost all the posts about it here, so I would really appreciate it if someone pushed me in the right direction.

I have a PDF, and I would like to extract it, I am only interested in words and spaces. I installed CGPDFScanner and callback methods. I read that I only need to consider 4 operators TJ, Tj, qout (') and doubleqout (") before receiving the text.

I think I also need to track the text space in order to be able to determine whether the letters should be combined to form a word or should be separated by a space. But I have no idea how I need it.

In PDF, all text is in

[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ

but I was not able to figure out (using the PDF specification) what these numbers mean. Someone at SO said you shouldn't be afraid of PDF specifications, but frankly, I don't find them very easy to read / understand.

I studied the PDFKitten code, which was useful.

Any help would be greatly appreciated.

+2

ios text extract pdf cgpdf

Dij Sep 17 '12 at 18:22

source share

1 answer

Martin r · Answer 1 · 2012-09-17T18:39:45+0000

I can not give you advice on how to extract words from PDF, but format

 [(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ

explained, for example, in PDF Specification 1.7 , section "9.4.3 Text Display Operators". Description of the TJ statement:

Show one or more text lines to position individual glyphs. Each element of the array must be either a string or a number. If element is a string, this statement should show the string. If it is a number, the operator must adjust the position of the text by this amount; that is, it must transform the text matrix Tm. The number must be expressed in thousandths of the text space.

Thus, the numbers adjust the distance between the letters.

IOS PDF for easy text parsing

More articles: