Parsing a PDF file using regular expressions in Python

Question

Parsing a PDF file using regular expressions in Python

I am trying to parse some elements of an object from a PDF file using the Python re-module. My goal is to parse every PDF object using regex. An example PDF object is as follows:

1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [ 3 0 R ] /Count 1 >> endobj ...

When I use "\d+\s\d+\sobj[\s,\S]*endobj" , it does not work (it continues to analyze using the last endobj). How to change the regular expression for a separate parsing of each object (in other words, part from 1 0 obj to endobj)?

+2

python regex parsing pdf

Iketani Kouichiro Oct 12 '10 at 13:22

source share

4 answers

Markus jarderot · Answer 1 · 2010-10-12T14:27:06+0000

If you use only regular expression, it is easy to create a PDF file that your program cannot process. PDF dictionaries and lists may contain other objects. Regex cannot handle recursive structures, at least not the Python re module.

A PDF file is a tree of objects and streams:

Dictionaries: << (name) * >>
Lists: [ (value) * ]
Names: / (plain char) *
Strings: ( (char) * )
Hex strings: < (hexchar) * >
Numbers: ( - )? (digit) + | (number) + . (number) * | . (digit) +)
Booleans: true | false
References: (number) + (space) + (number) + (space) + R

Omissions and comments are ignored in most places. Comments begin with % and run to the end of the line.

Indirect objects are indicated as:

 1 0 obj (any object) endobj

This object may be referred to as 1 0 R Indirect dictionaries may also have a stream:

 1 0 obj << /Length 22 >> stream (22 bytes of raw data) endstream endobj

The pdf file looks something like this:

 %PDF-1.4 %ÿÿÿÿ 1 0 obj << /Author (MizardX) >> endobj 2 0 obj << /Type /Catalog % more required keys >> endobj %lots of more indirect objects, one after another trailer << /Info 1 0 R /Root 2 0 R % ... more required keys >> xref 0 3 0000000000 65535 f 0000000015 00000 n 0000000054 00000 n startxref 225 %%EOF

The root of the object tree is the trailer object. Each object refers directly or indirectly to this dictionary.

The complexity inside the streams is much more complicated, but this does not affect the file structure.

Full specifications can be found on the Adobe website .

Steven · Answer 2 · 2010-10-12T13:53:20+0000

Not quite the answer to your exact question, but you might want to look at existing parsing PDF libraries in python, for example: pdfminer or pyPdf . (even if you do not use them, you can also look and see how they do it)

neil · Answer 3 · 2010-10-12T13:36:07+0000

Do you need to use *? if not a greedy version - see here .

Also note that the PDF format is very complex, especially when it starts to have binary streams inside it, but if you know that the PDF files you are looking at are simple, then this should work.

Aidas bendoraitis · Answer 4 · 2010-10-12T13:44:22+0000

The question mark after the repeated part should accept the minimum number of characters. Also, a comma is not needed, since \S already takes it into account.

 \d+\s\d+\sobj[\s\S]*?endobj

Parsing a PDF file using regular expressions in Python

More articles: