Can libtiff be used to decode CCITT encoded data when the length is unknown?

Question

Can libtiff be used to decode CCITT encoded data when the length is unknown?

In answers to this question: C ++ decodes CCITT encoded images in pdf files

It is indicated that libtiff can be used to decode CCITT encoded images. Of course, we must add the TIFF header to the CCITT stream in the actual TIFF file.

However, some images in PDF files are embedded images and their lengths are not specified, although their width, height and bit depth are indicated. A PDF reader is expected to decode the CCITT stream, read (width * height * depth) the bits of the decoded data and wherever it is after the data has been read, that end of the embedded image. Then it should go to the next page marking command, etc.

This creates a problem. The TIFF image file directory should indicate how many bytes are contained in each image data strip, but we will not know how many bytes of encoded data really belong to the image until we decrypt it, but we cannot decode the image without using libtiff ...

Is there a way to use libtiff here or do we need a special CCITT filter code?

+5

pdf tiff

Brian Oct 08 '16 at 1:00

source share

1 answer

LSerni · Accepted Answer · 2016-10-14T14:06:35+0000

Strictly speaking (is it possible to use libtiff ...?), Yes . This is due to some hacks, but not too many.

Fact: the data will consist of one strip, since there is no information about the offset, so our only offset is zero. We just need to read the strip.

Fact: this data is a W * H compression of a 1-bit pixel depth matrix.

Step 1: Estimate the maximum possible length of the compressed stream. This is about 15% of W * H, i.e. With W = 1000 and H = 1000 you get 150,000 bytes. This value will always be greater than the actual value. If we have a better rating, thanks to the location of the correct tag for the final EI image, this is even better, but not necessary.

Step 2: create a “virtual” TIF file. This will consist of the heading of form 49 49 2a 00 AA BB CC DD , where 0xDDCCBBAA is the estimated length plus 8; followed by our estimated data stream; followed by the TIFF directory.

Step 3: the TIFF directory will always have the same structure; some values in it are offsets and trivially depend on the position of IFD 0xDDCCBBAA. Quote from the TIFF6 specs (note that the byte order is canceled - Motorola, not Intel endian):

 TIFF 6.0 Specification Final—June 3, 1992 20 Putting it all together (along with a couple of less-important fields that are discussed later), a sample bilevel image file might contain the following fields A Sample Bilevel TIFF File Offset Description Value (hex) (numeric values are expressed in hexadecimal notation) Header: 0000 Byte Order 4D4D 0002 42 002A 0004 1st IFD offset 00000014 IFD: 0014 Number of Directory Entries 000C 0016 NewSubfileType 00FE 0004 00000001 00000000 0022 ImageWidth 0100 0004 00000001 000007D0 002E ImageLength 0101 0004 00000001 00000BB8 003A Compression 0103 0003 00000001 8005 0000 0046 PhotometricInterpretation 0106 0003 00000001 0001 0000 0052 StripOffsets 0111 0004 000000BC 000000B6(*1) 005E RowsPerStrip 0116 0004 00000001 00000010 006A StripByteCounts 0117 0003 000000BC 000003A6(*2) 0076 XResolution 011A 0005 00000001 00000696(*3) 0082 YResolution 011B 0005 00000001 0000069E(*4) 008E Software 0131 0002 0000000E 000006A6(*5) 009A DateTime 0132 0002 00000014 000006B6(*6) 00A6 Next IFD offset 00000000 Values longer than 4 bytes: (*1) StripOffsets Offset0 00000008 (*2) StripByteCounts Count0 (*3) XResolution 0000012C 00000001 (*4) YResolution 0000012C 00000001 (*5) Software "PageMaker 4.0" (*6) DateTime "1988:02:18 13:59:59"

In the above example, 0xDDCCBBAA is actually 0014, and all other offsets follow.

I conducted several tests using the TIFFG4 single-band image, which I created using ImageMagick and tiffcp 'ed, in a 1-band CCITT format. The title there is slightly different (I don't see the Software and Datetime tags, which, according to the specification, should be there). Otherwise, it checks.

Now we have a damaged TIFF image with one overlapping band, and it is in memory.

Using TIFFClientOpen , we can access it as if it were a disk image .

Attempting to read the front page will result in an error and interruption of the program:

 TIFFFillStrip: Read error on strip 0; got 143151 bytes, expected 762826.

Using TIFFSetErrorHandler and TIFFSetErrorHandlerExt , we set ourselves up to catch this error and analyze it, thereby recovering 143151 information instead of interrupting it.

We need to send callbacks to TIFFClientOpen , but they are all very lightweight:

 TIFFReadWriteProc readproc(h, *ptr, n) // copy n bytes from FakeBuffer+pos into ptr, update pos to pos + n, ignore h. TIFFReadWriteProc writeproc // Throw an error. We don't write TIFFSeekProc seekproc // update pos appropriately TIFFCloseProc closeproc // do nothing TIFFSizeProc sizeproc // return total buffer size TIFFMapFileProc mapproc // Set to NULL TIFFUnmapFileProc unmapproc // Set to NULL

The processing is really inconvenient and confusing, but as far as practicability is concerned, it can be done.

I ran C tests, manually extracting the CCITT stream from the embedded BI / ID / EI PDF image that I found on the Internet, and read it as described above.

If I had the right way to determine the correct EI - I dug up a Tilman Hauserr post explaining hack to recognize valid PDF operators, following EI to do this, which makes me think that there are probably not many better methods - I could always to evaluate the correct offset and directly create the correct and readable TIFF file from the PDF, without even involving libtiff at all.

Can libtiff be used to decode CCITT encoded data when the length is unknown?

More articles: