Reading pdf via pdfbox in java

I ran into a problem while reading pdf using pdfbox. My actual pdf is partially unreadable, so when I copy and paste the unreadable part into the editor, it displays small text characters, but when I try to read the same file through pdfbox, these characters are not readable (and I do not expect them to read) . I expect that I will at least get some characters or some random characters instead of the actual characters. Is there any way to do this. This line is selectable, therefore it is not an image. Has anyone found any workaround?

There is a pdfbox example where we override the writeString method in the pdfTextStripper class to get additional font properties. I use this method to get my text and some font properties. So my question is why the PDFbox does not read every character (it can print gibberish). But in my case, I thought no. the time the method was called (each method call corresponds to each character), and saw that it wasn’t. method calls matched the no.of characters in the output text, but did not match the result of no. characters in pdf format. Here is a pdf sample, the word "profit" is not readable, and in pdf there is not even gibberish for this word, it just skips it. Here's the link. https://drive.google.com/file/d/0B_Ke2amBgdpedUNwVTR3RVlRTFE/view?usp=sharing

+2
source share
1 answer

The first file is "PnL_500010_0314.pdf"

Indeed, virtually the entire line “Profit and Loss Statement for the year ended March 31, 2014" and much more cannot be extracted; checking the content becomes obvious: this text is written using a complex font that does not have an Encoding or ToUnicode record to allow the identification of the character in question.

org.apache.pdfbox.text.PDFTextStreamEngine(from which it is obtained PDFTextStripper) the method showGlyphshortly before the call processTextPosition(which PDFTextStripperimplements and from which it extracts its text information) contains this code:

// use our additional glyph list for Unicode mapping
unicode = font.toUnicode(code, glyphList);

// when there is no Unicode mapping available, Acrobat simply coerces the character code
// into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want
// this, which is why we leave it until this point in PDFTextStreamEngine.
if (unicode == null)
{
    if (font instanceof PDSimpleFont)
    {
        char c = (char) code;
        unicode = new String(new char[] { c });
    }
    else
    {
        // Acrobat doesn't seem to coerce composite font character codes, instead it
        // skips them. See the "allah2.pdf" TestTextStripper file.
        return;
    }
}

. , unicode null.

, , . , else processTextPosition .

PDFTextStripper, , " , 31 2014 " !

    else
    {
        // Acrobat doesn't seem to coerce composite font character codes, instead it
        // skips them. See the "allah2.pdf" TestTextStripper file.
        return;
    }

PDFTextStreamEngine.showGlyph unicode, . Unicode

    else
    {
        // Use the Unicode replacement character to indicate an unknown character
        unicode = "\uFFFD";
    }

57
THIRTY SEVENTH ANNUAL REPORT 2013-14
STANDALONE FINANCIAL STATEMENTS
                                                             
As per our report attached. Directors
For Deloitte Haskins & Sells LLP Deepak S. Parekh Nasser Munjee R. S. Tarneja
Chartered Accountants          B. S. Mehta J. J. Irani
D. N. Ghosh Bimal Jalan
Keki M. Mistry S. A. Dave D. M. Sukthankar
Sanjiv V. Pilgaonkar                
Partner                        
Renu Sud Karnad V. Srinivasa Rangan Girish V. Koliyote
      , May 6, 2014 Managing Director                                     
Notes Previous Year
  in Crore   in Crore
INCOME
                        23  23,894.03  20,796.95 
                             24  248.98  315.55 
             25  54.66  35.12 
Total Revenue  24,197.67  21,147.62 
EXPENSES
Finance Cost 26  16,029.37  13,890.89 
               27  279.18  246.19 
                       28  86.98  75.68 
               29  230.03  193.43 
                               11 & 12  31.87  23.59 
Provision for Contingencies  100.00  145.00 
Total Expenses  16,757.43  14,574.78 

PROFIT BEFORE TAX  7,440.24  6,572.84 
           
               1,973.00  1,727.68 
               14  27.00  (3.18)
PROFIT FOR THE YEAR 3  5,440.24  4,848.34 
EARNINGS PER SHARE                2) 31
- Basic 34.89 31.84
- Diluted 34.62 31.45
                                                             

, PDFTextStreamEngine.showGlyph . , PDFTextStripper, . PDFTextStreamEngine , Java, PDFBox.

.

"Bal_532935_0314.pdf"

PDFBox, . , :

    if (font instanceof PDSimpleFont)
    {
        char c = (char) code;
        unicode = new String(new char[] { c });
    }

, : Unicode, , -1, char. OP, .

, PDFBox , , if

    if (font instanceof PDSimpleFont)
    {
        // Use the Unicode replacement character to indicate an unknown character
        unicode = "\uFFFD";
    }

Aries Agro Care Private Limited
1118th Annual Report 2013-14
Balance Sheet as at 31st March, 2014
Particulars Note
No.
 As at 
31 March, 2014
Rupees
 As at
31 March, 2013
Rupees
I. EQUITY AND LIABILITIES
(1) Shareholder Funds
(a)               3  100,000  100,000
(b) Reserves and Surplus 4  (2,673,971)             
 (2,573,971)             
(2) Current Liabilities
(a) Short Term Borrowings 5  5,805,535            
(b) Trade Payables 6  159,400          
(c)                           7  2,500  22,743 
 5,967,435  5,934,756 
TOTAL  3,393,464            
II. ASSETS
(1) Non-Current Assets
(a)                         - -
 - -
(2) Current Assets
(a)                         9  39,605        
(b)                               10  3,353,859           
 3,393,464           
TOTAL  3,393,464           
                                
The Notes to Accounts 1 to 23 form part of these Financial Statements
As per our report of even date For and on behalf of the Board
For Kirti D. Shah & Associates 
                      
                             
Dr. Jimmy Mirchandani
Director
Kirti D. Shah 
Proprietor 
Membership No 32371
Dr. Rahul Mirchandani 
Director
Place : Mumbai. 
Date :- 26th May, 2014.
+8

Source: https://habr.com/ru/post/1654546/


All Articles