Parsing a PDF file using regular expressions in Python

I am trying to parse some elements of an object from a PDF file using the Python re-module. My goal is to parse every PDF object using regex. An example PDF object is as follows:

1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [ 3 0 R ] /Count 1 >> endobj ... 

When I use "\d+\s\d+\sobj[\s,\S]*endobj" , it does not work (it continues to analyze using the last endobj). How to change the regular expression for a separate parsing of each object (in other words, part from 1 0 obj to endobj)?

+2
source share
4 answers

If you use only regular expression, it is easy to create a PDF file that your program cannot process. PDF dictionaries and lists may contain other objects. Regex cannot handle recursive structures, at least not the Python re module.

A PDF file is a tree of objects and streams:

  • Dictionaries: << (name) * >>
  • Lists: [ (value) * ]
  • Names: / (plain char) *
  • Strings: ( (char) * )
  • Hex strings: < (hexchar) * >
  • Numbers: ( - )? (digit) + | (number) + . (number) * | . (digit) +)
  • Booleans: true | false
  • References: (number) + (space) + (number) + (space) + R

Omissions and comments are ignored in most places. Comments begin with % and run to the end of the line.

Indirect objects are indicated as:

 1 0 obj (any object) endobj 

This object may be referred to as 1 0 R Indirect dictionaries may also have a stream:

 1 0 obj << /Length 22 >> stream (22 bytes of raw data) endstream endobj 

The pdf file looks something like this:

 %PDF-1.4 %รฟรฟรฟรฟ 1 0 obj << /Author (MizardX) >> endobj 2 0 obj << /Type /Catalog % more required keys >> endobj %lots of more indirect objects, one after another trailer << /Info 1 0 R /Root 2 0 R % ... more required keys >> xref 0 3 0000000000 65535 f 0000000015 00000 n 0000000054 00000 n startxref 225 %%EOF 

The root of the object tree is the trailer object. Each object refers directly or indirectly to this dictionary.

The complexity inside the streams is much more complicated, but this does not affect the file structure.

Full specifications can be found on the Adobe website .

+6
source

Not quite the answer to your exact question, but you might want to look at existing parsing PDF libraries in python, for example: pdfminer or pyPdf . (even if you do not use them, you can also look and see how they do it)

+2
source

Do you need to use *? if not a greedy version - see here .

Also note that the PDF format is very complex, especially when it starts to have binary streams inside it, but if you know that the PDF files you are looking at are simple, then this should work.

+1
source

The question mark after the repeated part should accept the minimum number of characters. Also, a comma is not needed, since \S already takes it into account.

 \d+\s\d+\sobj[\s\S]*?endobj 
+1
source

Source: https://habr.com/ru/post/899998/


All Articles