PDF file structure?

For a small project, I have to parse pdf files and take a certain part of them (a simple string of characters). I would like to use python for this, and I have found several libraries that are capable of doing what I want in some way.

But now, after several studies, I wonder what the real structure of the pdf file is, does anyone know if there is a specification or some explanation anywhere on the Internet? I found a link to adobe, but it seems like this is a dead link :(

+63
pdf
Sep 17 '08 at 23:11
source share
12 answers

Here is a link to Adobe reference material

http://www.adobe.com/devnet/pdf/pdf_reference.html

You should be aware that this PDF is only for presentation, not structure. The analysis will not be easy.

+41
Sep 17 '08 at 23:13
source share

I found the GNU Introduction to PDF to help understand the structure. It includes an easy-to-read sample PDF file that they fully describe.

Other useful links:

+27
Aug 12 '14 at 15:31
source share

When I first started working with PDF, I found a link pdftron CosEdit allows you to view the structure of an object to understand it. There is a free demo version that allows you to view the file but not save it.

+24
Sep 18 '08 at 13:26
source share

Here's the > link describing the structure of the PDF file . If you use Vim, the pdftk plugin is a good way to examine a document in an increasingly less raw form, and pdftk the utility itself (and its GPL source) is a great way to separate documents.

+10
Sep 17 '08 at 23:18
source share

I am trying to do almost the same thing. The PDF link is a very difficult document to read. This tutorial is the best start I'm thinking of.

+7
Jul 09 '09 at 7:13
source share

This may help shed some light: (from page 11 of PDF32000.book)

The PDF syntax is best understood by looking at it in four parts, as shown in Figure 1:

• Objects. A PDF document is a data structure consisting of a small set of basic types of data objects. Subclause 7.2, “Lexical Conventions,” describes the character set used to record objects and other syntax elements. Subclause 7.3 "Objects" describes the syntax and essential properties of objects. Subclause 7.3.8 "Stream objects" contains detailed information about the most complex data type, stream object.

• File structure. The structure of the PDF file determines how objects are stored in the PDF file, how they access and update them. This structure is independent of the semantics of objects. Sub-section 7.5 “File structure” describes the file structure. Subclause 7.6 "Encryption" describes the file level mechanism for protecting the contents of documents from unauthorized access.

• Document structure. The structure of a PDF document determines how the basic types of objects used to represent the components of a PDF document: pages, fonts, annotations, etc. Subclause 7.7, “Document Structure” describes the general structure of the document; later articles on component semantics.

• Content streams. A PDF content stream contains a series of instructions describing the appearance of a page or other graphic. These instructions, which are also presented as objects conceptually distinct from objects that represent the structure of the document and are described separately. Subclause 7.8, “Content Streams and Resources,” discusses PDF content streams and related resources.

It seems like navigating a PDF file will take a little more than going through the effort.

+6
Jul 30 2018-11-11T00:
source share

If you want to parse PDF using Python, check out PDFMINER . This is the best library to analyze pdf files up to date.

+3
Sep 17 '13 at 11:54 on
source share
+3
Mar 02 '14 at 3:44
source share

Extracting text from a PDF is a difficult problem because PDF has such a layout-oriented structure. You can see the documents and source code of my barely successful CPAN attempt (my Perl implementation). The PDF data structure is very cool and well thought out, but it's easier to write than to read.

+2
Sep 19 '08 at 2:51
source share

One way to get some hints is to create a PDF file consisting of a blank page. I have a CutePDF Writer on my computer and made an empty Wordpad document on one page. Printed in a .pdf file, and then the .pdf file is opened using Notepad.

Then use a copy of this file and exclude lines or blocks of text that may be of interest, and then restart Acrobat Reader. You will be surprised at how little information is required to create a one-page PDF document.

I am trying to make a table to create a PDF form from code.

+2
Aug 24 '10 at 16:52
source share

You need a PDF Reference Guide to start reading about the details and structure of PDF files. I suggest starting with version 1.7.

On windows, I used the free PDF Analyzer tool to see the internal structure of PDF files. This will help in your understanding when reading the reference manual.

enter image description here

(I am associated with PDF Analyzer, not going to promote)

0
Dec 17 '18 at 8:06
source share

To extract text from a PDF, try this on a computer with Linux, BSD, etc. Or use Cygwin if on Windows:

pdfinfo -layout some_pdf_file.pdf 

A simple text file was created with the name some_pdf_file.txt . The simpler the layout of the PDF file, the easier the output of the .txt file will be.

Hexadecimal characters are often present in the output of a .txt file and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, markers, hyphens, etc. In pdf.

To see the context in which hexadecimal characters are displayed, run this grep command and keep the original PDF file handy to see which characters represent the codes in the PDF file:

 grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt 

This will provide a unique list of the various octal codes in the document:

 grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq 

To convert these hexadecimal characters to ASCII equivalents, you can use a combination of grep, sed and bc, I will publish this procedure soon.

0
Jul 26 '19 at 12:28
source share



All Articles