Why is my image distorted when decoding as FlateDecode using iTextSharp?

When decoding a PDF image as FlateDecode via iTextSharp, the image is distorted and I cannot understand why.

The recognized bpp is Format1bppIndexed . If I change the PixelFormat value to Format4bppIndexed , the image can be recognized to some extent (compression, coloring off, but readable) and repeated 4 times in a horizontal manner. If I set the pixel format to Format8bppIndexed , it will also be recognized to some extent and will be duplicated 8 times in a horizontal manner.

Below is the image after Format1bppIndexed pixel format. Unfortunately, I cannot show others due to security restrictions.

distorted image

Below is the code, which is essentially the only solution I came across, littered around SO and the Internet.

 int xrefIdx = ((PRIndirectReference)obj).Number; PdfObject pdfObj = doc.GetPdfObject(xrefIdx); PdfStream str = (PdfStream)(pdfObj); byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str); string filter = ((PdfArray)tg.Get(PdfName.FILTER))[0].ToString(); string width = tg.Get(PdfName.WIDTH).ToString(); string height = tg.Get(PdfName.HEIGHT).ToString(); string bpp = tg.Get(PdfName.BITSPERCOMPONENT).ToString(); if (filter == "/FlateDecode") { bytes = PdfReader.FlateDecode(bytes, true); System.Drawing.Imaging.PixelFormat pixelFormat; switch (int.Parse(bpp)) { case 1: pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed; break; case 8: pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed; break; case 24: pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb; break; default: throw new Exception("Unknown pixel format " + bpp); } var bmp = new System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat); System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat); Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length); bmp.UnlockBits(bmd); bmp.Save(@"C:\temp\my_flate_picture-" + DateTime.Now.Ticks.ToString() + ".png", ImageFormat.Png); } 

What do I need to do so that my image extraction works as desired when working with FlateDecode ?

NOTE I do not want to use another library to extract images. I am looking for a solution using ONLY iTextSharp and .NET FW. If the solution exists through Java (iText) and is easily ported to .NET FW bits, that would also be sufficient.

UPDATE The ImageMask property is ImageMask to true, which implies the absence of color space and therefore implicitly black and white. When bpp enters 1, the value of the PixelFormat should be Format1bppIndexed , which, as mentioned earlier, creates the inline image seen above.

UPDATE To get the image size, I extracted it using Acrobat X Pro, and the image size for this particular example was specified as 2403x3005. When retrieving via iTextSharp, the size was specified as 2544x3300. I resized the image in the debugger so that the mirror is 2403x3005, however when calling Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length); I get an exception.

Attempted to read or write protected memory. It is often that other memory is corrupted.

My guess is that this is due to resizing and therefore no longer matches the byte data used.

UPDATE : as recommended by Jimmy, I confirmed that calling PdfReader.GetStreamBytes returns a byte length [] equal to / 8 width, since GetStreamBytes should call FlateDecode . In the manual call of FlateDecode and the call of PdfReader.GetStreamBytes , the byte length [] 1049401 was created, and the width / 8 was 2544 * 3300/8 or 1049400, so there is a difference of 1. Not sure if this will be the root cause or not, is disabled by one; however, I'm not sure how to decide if this is true.

UPDATE . When trying the approach mentioned by kuujinbo, I encounter an IndexOutOfRangeException when I try to call renderInfo.GetImage(); inside the RenderImage . The fact that the * height / 8 width, as mentioned above, is disabled by 1 compared to the byte length [] when calling FlateDecode , makes me think that they are all the same; however, the solution is still eluding me.

  at System.util.zlib.Adler32.adler32(Int64 adler, Byte[] buf, Int32 index, Int32 len) at System.util.zlib.ZStream.read_buf(Byte[] buf, Int32 start, Int32 size) at System.util.zlib.Deflate.fill_window() at System.util.zlib.Deflate.deflate_slow(Int32 flush) at System.util.zlib.Deflate.deflate(ZStream strm, Int32 flush) at System.util.zlib.ZStream.deflate(Int32 flush) at System.util.zlib.ZDeflaterOutputStream.Write(Byte[] b, Int32 off, Int32 len) at iTextSharp.text.pdf.codec.PngWriter.WriteData(Byte[] data, Int32 stride) at iTextSharp.text.pdf.parser.PdfImageObject.DecodeImageBytes() at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PdfDictionary dictionary, Byte[] samples) at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PRStream stream) at iTextSharp.text.pdf.parser.ImageRenderInfo.PrepareImageObject() at iTextSharp.text.pdf.parser.ImageRenderInfo.GetImage() at cyos.infrastructure.Core.MyImageRenderListener.RenderImage(ImageRenderInfo renderInfo) 

UPDATE . Trying to modify the various methods listed here in my original solution, as well as the solution offered by kuujinbo with another page in the PDF, creates images; however, problems always arise when the filter type is /FlateDecode , and no image is created for this instance.

+4
source share
2 answers

Try copying the data row by row, maybe this will solve the problem.

 int w = imgObj.GetAsNumber(PdfName.WIDTH).IntValue; int h = imgObj.GetAsNumber(PdfName.HEIGHT).IntValue; int bpp = imgObj.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue; var pixelFormat = PixelFormat.Format1bppIndexed; byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObj); byte[] decodedBytes = PdfReader.FlateDecode(rawBytes); byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, imgObj.GetAsDict(PdfName.DECODEPARMS)); // byte[] streamBytes = PdfReader.GetStreamBytes((PRStream)imgObj); // same result as above 3 lines of code. using (Bitmap bmp = new Bitmap(w, h, pixelFormat)) { var bmpData = bmp.LockBits(new Rectangle(0, 0, w, h), ImageLockMode.WriteOnly, pixelFormat); int length = (int)Math.Ceiling(w * bpp / 8.0); for (int i = 0; i < h; i++) { int offset = i * length; int scanOffset = i * bmpData.Stride; Marshal.Copy(streamBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length); } bmp.UnlockBits(bmpData); bmp.Save(fileName); } 
+8
source

If you can use the latest version (5.1.3), the API for extracting FlateDecode and other types of images has been simplified using the iTextSharp.text.pdf.parser namespace. Basically you use PdfReaderContentParser to help you parse a PDF document, then you implement an IRenderListener (in this case) for image processing. Here's a working example of an HTTP handler:

 <%@ WebHandler Language="C#" Class="bmpExtract" %> using System; using System.Collections.Generic; using System.IO; using System.Web; using iTextSharp.text; using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; public class bmpExtract : IHttpHandler { public void ProcessRequest (HttpContext context) { HttpServerUtility Server = context.Server; HttpResponse Response = context.Response; PdfReader reader = new PdfReader(Server.MapPath("./bmp.pdf")); PdfReaderContentParser parser = new PdfReaderContentParser(reader); MyImageRenderListener listener = new MyImageRenderListener(); for (int i = 1; i <= reader.NumberOfPages; i++) { parser.ProcessContent(i, listener); } for (int i = 0; i < listener.Images.Count; ++i) { string path = Server.MapPath("./" + listener.ImageNames[i]); using (FileStream fs = new FileStream( path, FileMode.Create, FileAccess.Write )) { fs.Write(listener.Images[i], 0, listener.Images[i].Length); } } } public bool IsReusable { get { return false; } } public class MyImageRenderListener : IRenderListener { public void RenderText(TextRenderInfo renderInfo) { } public void BeginTextBlock() { } public void EndTextBlock() { } public List<byte[]> Images = new List<byte[]>(); public List<string> ImageNames = new List<string>(); public void RenderImage(ImageRenderInfo renderInfo) { PdfImageObject image = null; try { image = renderInfo.GetImage(); if (image == null) return; ImageNames.Add(string.Format( "Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType() )); using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes())) { Images.Add(ms.ToArray()); } } catch (IOException ie) { /* * pass-through; image type not supported by iText[Sharp]; eg jbig2 */ } } } } 

The iText [Sharp] development team is still working on the implementation, so I can’t say for sure if this will work in your case. But it works on in this simple pdf example . (used above and with several other pdf files I tried with bitmap images)

EDIT . I also experimented with the new API and made a mistake in the source code example above. The PdfImageObject should have been initialized to null outside the try..catch block. Correction made above.

Also, when I use the above code for an unsupported image type (like jbig2), I get another exception - "XX color depth is not supported", where "XX" is a number. And iTextSharp supports FlateDecode in all the examples I tried. (but this does not help you in this case, I know)

Is PDF created by third-party software? (not Adobe) From what I read in the book, some third-party manufacturers release PDF files that do not fully comply with the specification, and iText [Sharp] cannot work with some of these PDF files, while Adobe products can, IIRC I saw cases specific to some PDF files created by Crystal Reports on the iText mailing list, which caused problems, there is one thread here .

Is it possible to create a test PDF file using the software that you use with some FlateDecode insensitive images? Then maybe someone here can help a little better.

+1
source

Source: https://habr.com/ru/post/1386157/


All Articles