How to extract attachments from a PDF file?

I have a large number of PDF documents with xml files attached to them. I would like to extract these attached xml files and read them. How to do this programmatically using .net?

+6
source share
5 answers

iTextSharp is also quite capable of extracting attachments ... ugh ... although you may have to use low-level objects for this.

There are two ways to embed files in a PDF:

  • In file annotation
  • At the document level "EmbeddedFiles".

As soon as you have a dictionary of file specifications from any source, the file itself will be in the stream labeled "EF" (embedded file).

So, to list all the files at the document level, you can write code (in Java):

Map<String, byte[]> files = new HashMap<String,byte[]>(); PdfReader reader = new PdfReader(pdfPath); PdfDictionary root = reader.getCatalog(); PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null PdfArray embeddedFiles = names.getAsArray(PdfName.EMBEDDEDFILES); //may be null int len = embeddedFiles.size(); for (int i = 0; i < len; i += 2) { PdfName name = embeddedFiles.getAsName(i); // should always be present PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto PRStream stream = (PRStream)fileSpec.getAsStream(PdfName.EF); if (stream != null) { files.put( PdfName.decodeName(name.toString()), stream.getBytes() ); } } 
+6
source

This is an old question, however, I think that my alternative solution (using PDF Clown ) may be of interest, since it is much cleaner (and more complete, since iterates both at the document level and at the page level), than previously suggested code snippets:

 using org.pdfclown.bytes; using org.pdfclown.documents; using org.pdfclown.documents.files; using org.pdfclown.documents.interaction.annotations; using org.pdfclown.objects; using System; using System.Collections.Generic; void ExtractAttachments(string pdfPath) { Dictionary<string, byte[]> attachments = new Dictionary<string, byte[]>(); using(org.pdfclown.files.File file = new org.pdfclown.files.File(pdfPath)) { Document document = file.Document; // 1. Embedded files (document level). foreach(KeyValuePair<PdfString,FileSpecification> entry in document.Names.EmbeddedFiles) {EvaluateDataFile(attachments, entry.Value);} // 2. File attachments (page level). foreach(Page page in document.Pages) { foreach(Annotation annotation in page.Annotations) { if(annotation is FileAttachment) {EvaluateDataFile(attachments, ((FileAttachment)annotation).DataFile);} } } } } void EvaluateDataFile(Dictionary<string, byte[]> attachments, FileSpecification dataFile) { if(dataFile is FullFileSpecification) { EmbeddedFile embeddedFile = ((FullFileSpecification)dataFile).EmbeddedFile; if(embeddedFile != null) {attachments[dataFile.Path] = embeddedFile.Data.ToByteArray();} } } 

Note that you don’t have to worry about null-pointer exceptions, as PDF Clown provides all the necessary abstraction and automation to ensure the model runs smoothly.

PDF Clown is an LGPL 3 library implemented on both Java and .NET platforms (I am its lead developer): if you want to try it, I suggest you check your SVN repository at sourceforge.net as it continues to evolve.

+3
source

Look for ABCpdf- Library, very easy and fast, in my opinion.

+2
source

You can try Aspose.Pdf.Kit for .NET . The PdfExtractor class allows you to extract attachments using two methods: ExtractAttachment and GetAttachment. See an example of an attachment retrieval example .

Disclosure: I work as an evangelist developer at Aspose.

+1
source

What I received is slightly different from what I saw on the Internet.

So, just in case, I thought I'd post it here to help someone else. I had to go through many different iterations to figure out - the hard way - what I need to get it to work.

I am combining two PDF files into a third PDF file, in which one of the first two PDF files can have file attachments that need to be transferred to a third PDF file. I fully work in threads with ASP.NET, C # 4.0, ITextSharp 5.1.2.0.

  // Extract Files from Submit PDF Dictionary<string, byte[]> files = new Dictionary<string, byte[]>(); PdfDictionary names; PdfDictionary embeddedFiles; PdfArray fileSpecs; int eFLength = 0; names = writeReader.Catalog.GetAsDict(PdfName.NAMES); // may be null, writeReader is the PdfReader for a PDF input stream if (names != null) { embeddedFiles = names.GetAsDict(PdfName.EMBEDDEDFILES); //may be null if (embeddedFiles != null) { fileSpecs = embeddedFiles.GetAsArray(PdfName.NAMES); //may be null if (fileSpecs != null) { eFLength = fileSpecs.Size; for (int i = 0; i < eFLength; i++) { i++; //objects are in pairs and only want odd objects (1,3,5...) PdfDictionary fileSpec = fileSpecs.GetAsDict(i); // may be null if (fileSpec != null) { PdfDictionary refs = fileSpec.GetAsDict(PdfName.EF); foreach (PdfName key in refs.Keys) { PRStream stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key)); if (stream != null) { files.Add(fileSpec.GetAsString(key).ToString(), PdfReader.GetStreamBytes(stream)); } } } } } } } 
+1
source

Source: https://habr.com/ru/post/890222/


All Articles