PDFminer blank output

Question

PDFminer blank output

When processing a file using pdfminer (pdf2txt.py), I got empty output:

dan@work:~/project$ pdf2txt.py  docs/homericaeast.pdf 

dan@work:~/project$

Can someone say what is wrong with this file, and what can I do to get data from it?

Here's the dumppdf.py docs/homericaeast.pdfconclusion:

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>

+6

python pdf pdf-parsing pdfminer

Daniel M May 07 '17 at 14:10

source share

2 answers

The problem is that the pdfminerCMap you are using in this PDF file does not understand.

pdfminer STRICT=1 psparser.py, :

pdfminer.psparser.PSTypeError: Literal required: <PDFStream(21): raw=267, {u'Filter': /'FlateDecode', u'CMapName': /u'OneByteIdentityH', u'Type': /u'CMap', u'CIDSystemInfo': <PDFObjRef:20>, u'Length': 266}>

, , ( OneByteIdentityH , ). , CMap , PDF ( None, ).

, CMap, , Identity, cmapdb.py

+2

Peter Brittain 12 '17 17:21

hynekcer · Accepted Answer · 2017-05-13T23:06:06+0000

Now I fixed the problem with the /OneByteIdentityHsame code for double-byte Unicode mapping /Identity-H. The patch is in PR # 179

PDFminer blank output

More articles: