Cannot read some PDF files using iTextSharp

I have a Win32 application that reads PDF files using iTextSharp, which inserts the image into the document as a print. A.

It works great with 99% of the files that we process during the year, but these days some files just cannot be read. When I execute the code below:

string inputfile = "C:\test.pdf"; PdfReader reader = new PdfReader(inputfile); 

This gives an exception:

 System.NullReferenceException occurred Message="Object reference not set to an instance of an object." Source="itextsharp" StackTrace: em iTextSharp.text.pdf.PdfReader.ReadPages() em iTextSharp.text.pdf.PdfReader.ReadPdf() em iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[] ownerPassword) em iTextSharp.text.pdf.PdfReader..ctor(String filename) em MyApp.insertSeal() na C:\MyApp\Stamper.cs:linha 659 

The PDF files that throw this exception can be read normally by adobe pdf, and when I open one of these files with Acrobat and save it, I can read this saved file with my application.

Files are corrupted, but can still be opened using Adobe Reader?

I am sharing with you two sample files.

File that does NOT work: Not-Ok-Version.pdf

And the file that works after opening and saving it using Acrobat. Download it here OK-Version.pdf

+4
source share
5 answers

Here's the source (java, sorry) for readPages:

 protected internal void ReadPages() { catalog = trailer.GetAsDict(PdfName.ROOT); rootPages = catalog.GetAsDict(PdfName.PAGES); pageRefs = new PageRefs(this); } 

trailer, catalog , rootPages , and pageRefs` are all PdfReader member variables.

If the trailer or the root / catalog object in PDF format is simply missing, your PDF is REALLY FAST. Most likely, the xref table is a bit off, and the objects in question are simply not exactly where they should be (which is bad, but can be restored).

HOWEVER, when PdfReader first opens a PDF file, it parses ALL the objects in the file and converts them to the appropriate classes based on PdfObject.

What he does not do is check that the number of the object declared in the xref table and the number of the object read from the file "Match itself". Very unlikely, but possible. Poor software may write its PDF objects in the wrong order, but save byte offsets in the xref table correctly. Software that overrides the object number from the xref table with the number from this particular byte offset in the file will be fine.

iText is wrong.

I still want to see the PDF.


Yeah. This pdf file is corrupt. In particular:

The file first 70kb or so defines a fairly clean small PDF. Then changes were added to the PDF.

Check this. Someone tried to add changes to the PDF and failed. Bad. To understand how bad it is, let me explain some of the internal PDF syntax illustrated in this example:

 %%PDF1.6 1 0 obj <</Type/SomeObject ...>> endobj 2 0 obj <</Type/SomeOtherObj /Ref 1 0 R>> endobj 3 0 obj ... endobj <etc> xref 0 10 0000000000 65535 f 0000000010 00001 n 0000000049 00002 n 0000000098 00003 n ... trailer <</Root 4 0 R /Size 10>> startxref 124 %%EOF 

So, we have the header / version "%% PDF1.v", a list of objects (those called dictionaries here), a cross (x) look-up table that lists the byte offsets and object numbers of all objects in a list and a trailer giving the root object and the number of objects in the PDF, and the byte offset by "x" in "xref".

You can add changes to an existing PDF. To do this, you simply add new or changed objects after the existing %% EOF, cross-referencing these new objects and the trailer. The trailer for the added change should include the / Prev switch with a byte offset in the previous cross-reference table.

In your pdf NOT-OKAY file, someone has tried to add changes to the PDF, AND UNFINISHingly CONVENIENT .

The original PDF file still exists, intact. This is what the reader shows you and what you get when you save the PDF. I deleted everything after the first %% EOF in a hex editor, and the file was fine.

So, here is the layout of your NOT-OKAY pdf:

 %PDF1.4.1 1 0 obj... 2 through 7 xref 0 7 <healthy xref> trailer <</Size 8 /Root 6 0 R /Info 7 0 R>> startxref 68308 %%EOF 

So far so good. Here where things get ugly

 <binary garbage> endstream endobj xref 0 7 <horribly wrong xref> trailer <</ID [...] /Info 1 0 R /Root 2 0 R /Size 7>> startxref 223022 %%EOF 

The only thing related to this section is the value of startxref.

Problems:

  • The second trailer does not have a / Prev switch.
  • All byte offsets in the second link table are invalid.
  • This is part of the stream object, but the beginning of this object is IS MISSING. Streams should look something like this.

 1 0 obj <</Type/SomeType/Length 123>> stream 123 bytes of data endstream endobj 

The end of this file consists of some part (compressed, I suppose) of the stream ... but without the dictionary at the beginning telling us what filters its use and how long it (not to mention any missing data), you can do nothing with this to do.

I suspect that someone tried to completely rebuild this PDF file, and then accidentally wrote the original 70 kb at the beginning of their version. Kaboom.

It seems like Adobe is simply ignoring the bad changes applied. iText can do this too, but you can also:

When iText does not open the PDF file:
1. Search backwards through the file, looking for the second second of %%EOF . Ignore the one that is at the very end, we want the previous state of the file. 2. Delete everything after the second-last %%EOF (if any) and try opening it again.

The sad thing is that this broken PDF could be completely different from the "original" 70kb, and then some input / output error is sewn up by the first part of the file. Hardly, but there is no way to be sure.

+8
source

Given that they are now up to version 5.0, I assume that you see an increasing number of PDF files written in PDF version specifications that are not supported by your version of iTextSharp. Maybe it's time to do an update.

+3
source

When I pull out the source and run it against a bad PDF, there is an exception in ReadPdf() in the fourth try block when it calls ReadDocObj() :

 "Invalid object number. at file pointer 16" 

tokens.StringValue j

@Mark Storer, you are an iText guy, so maybe this means something to you.

From a higher level, at least for my eyes, it seems that when RebuildXref() is called (which I assume when an invalid PDF is read), it restores the trailer , but not the catalog . The latter is what the NRE is complaining about. Again, this is just an assumption.

+1
source

Maybe this will help someone ... I had a code that worked for many years, which began to curl when reading bookmarks from a PDF file (outlines a variable below). It turned out that it broke when the code was upgraded from .NET 4.0 to .NET 4.5.
As soon as I put it back in .NET 4.0, it worked again.

  RandomAccessFileOrArray raf = null; PdfReader reader1 = null; System.Collections.ArrayList outlines = null; raf = new iTextSharp.text.pdf.RandomAccessFileOrArray(sFile); reader1 = new iTextSharp.text.pdf.PdfReader(raf, null); outlines = iTextSharp.text.pdf.SimpleBookmark.GetBookmark(reader1); 

For notes only, the same VS web application project uses AjaxControlToolkit (from NuGet). Before I brought it back, I also upgraded iTextSharp to version 5.5.5, and it still hung on the same line.

+1
source

Also make sure your html does not contain an hr tag when converting html to pdf

 hdnEditorText.Value.Replace("\"", "'").Replace("<hr />", "").Replace("<hr/>", "") 
0
source

Source: https://habr.com/ru/post/1341348/


All Articles