If you use only regular expression, it is easy to create a PDF file that your program cannot process. PDF dictionaries and lists may contain other objects. Regex cannot handle recursive structures, at least not the Python re module.
A PDF file is a tree of objects and streams:
- Dictionaries:
<< (name) * >> - Lists:
[ (value) * ] - Names:
/ (plain char) * - Strings:
( (char) * ) - Hex strings:
< (hexchar) * > - Numbers: (
- )? (digit) + | (number) + . (number) * | . (digit) +) - Booleans:
true | false - References: (number) + (space) + (number) + (space) +
R
Omissions and comments are ignored in most places. Comments begin with % and run to the end of the line.
Indirect objects are indicated as:
1 0 obj (any object) endobj
This object may be referred to as 1 0 R Indirect dictionaries may also have a stream:
1 0 obj << /Length 22 >> stream (22 bytes of raw data) endstream endobj
The pdf file looks something like this:
%PDF-1.4 %รฟรฟรฟรฟ 1 0 obj << /Author (MizardX) >> endobj 2 0 obj << /Type /Catalog % more required keys >> endobj %lots of more indirect objects, one after another trailer << /Info 1 0 R /Root 2 0 R % ... more required keys >> xref 0 3 0000000000 65535 f 0000000015 00000 n 0000000054 00000 n startxref 225 %%EOF
The root of the object tree is the trailer object. Each object refers directly or indirectly to this dictionary.
The complexity inside the streams is much more complicated, but this does not affect the file structure.
Full specifications can be found on the Adobe website .
source share