Is there a PDF parser written in objective-c or c?

I am writing an iPhone application for reading PDF documents.

I know how to show pdf file using CGPDF ** classes in iOS.

Now I want to search for text in a pdf file and select the search text. So I need a library that can determine which text is in which position. In addition, I want the library to be able to handle Unicode characters and Chinese characters.

I searched for a few days, but still can not find anything suitable.

I tried xpdf, but it is written in C ++. I do not know how to use C ++ code in an iPhone application.

I also tried http://www.codeproject.com/KB/cpp/ExtractPDFText.aspx but it does not handle Chinese characters.

I tried the code myself, but encoding in PDF is really complicated.

For example, I don’t know what to refer to when I want to decode the text in the following font:

8 0 obj << /Type /Font /Subtype /Type0 /Encoding /Identity-H /BaseFont /RNXJTV+PMingLiU /DescendantFonts [ 157 0 R ] >> endobj 157 0 obj << /Type /Font /Subtype /CIDFontType2 /BaseFont /RNXJTV+PMingLiU /CIDSystemInfo << /Registry (Adobe) /Ordering (CNS1) /Supplement 0 >> /FontDescriptor 158 0 R /W 161 0 R /DW 1000 /CIDToGIDMap 162 0 R >> endobj 158 0 obj << /Type /FontDescriptor /Ascent 801 /CapHeight 711 /Descent -199 /Flags 32 /FontBBox [0 -199 999 801] /FontName /RNXJTV+PMingLiU /ItalicAngle 0 /StemV 0 /Leading 199 /MaxWidth 1000 /XHeight 533 /FontFile2 159 0 R >> endobj 
+4
source share
3 answers

Take a look at the type of CGPDFScanner ; it can be used to parse a pdf document for strings and specific pdf statements.

+4
source

There are some errors in this code that can be easily fixed. Well-presented Objective-C code.

https://github.com/KurtCode/PDFKitten

+3
source

CGPDFScanner can only scan PDF content, but you cannot find the location of a word in pdf. Therefore, highlighting is not possible using the cgpdf functions. Also, the scanner output is encoded text for flateDecoded and other types of pdf. It can scan only simple pdf files as well as linear PDF files. (Open the pdf as a text file, and at the top you will find the word Linearized pdf.) A possible solution is to use the c or c + parsing library if you receive it. In addition, the cpp project from the code project will only analyze the content, but does not provide any location information. Writing your own PDF analyzer is difficult because PDF formats are complex and not fixed. Pdf content can be encoded in various ways, such as type FlateDecode, etc.

0
source

Source: https://habr.com/ru/post/1334063/


All Articles