I completely lost this topic. I read almost all the posts about it here, so I would really appreciate it if someone pushed me in the right direction.
I have a PDF, and I would like to extract it, I am only interested in words and spaces. I installed CGPDFScanner and callback methods. I read that I only need to consider 4 operators TJ, Tj, qout (') and doubleqout (") before receiving the text.
I think I also need to track the text space in order to be able to determine whether the letters should be combined to form a word or should be separated by a space. But I have no idea how I need it.
In PDF, all text is in
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
but I was not able to figure out (using the PDF specification) what these numbers mean. Someone at SO said you shouldn't be afraid of PDF specifications, but frankly, I don't find them very easy to read / understand.
I studied the PDFKitten code, which was useful.
Any help would be greatly appreciated.
source share