PDFminer blank output

When processing a file using pdfminer (pdf2txt.py), I got empty output:

dan@work:~/project$ pdf2txt.py  docs/homericaeast.pdf 

dan@work:~/project$ 

Can someone say what is wrong with this file, and what can I do to get data from it?

Here's the dumppdf.py docs/homericaeast.pdfconclusion:

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>

<trailer>
<dict size="4">
<key>Info</key>
<value><ref id="2" /></value>
<key>Root</key>
<value><ref id="1" /></value>
<key>ID</key>
<value><list size="2">
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
<string size="16">on&#10;&#164;&#181;F&#164;5&#193;&#62;&#243;_&#253;v&#172;`</string>
</list></value>
<key>Size</key>
<value><number>27</number></value>
</dict>
</trailer>
+6
source share
2 answers

Now I fixed the problem with the /OneByteIdentityHsame code for double-byte Unicode mapping /Identity-H. The patch is in PR # 179

+4
source

The problem is that the pdfminerCMap you are using in this PDF file does not understand.

pdfminer STRICT=1 psparser.py, :

pdfminer.psparser.PSTypeError: Literal required: <PDFStream(21): raw=267, {u'Filter': /'FlateDecode', u'CMapName': /u'OneByteIdentityH', u'Type': /u'CMap', u'CIDSystemInfo': <PDFObjRef:20>, u'Length': 266}>

, , ( OneByteIdentityH , ). , CMap , PDF ( None, ).

, CMap, , Identity, cmapdb.py

+2

Source: https://habr.com/ru/post/1017061/


All Articles