Extract all pdf data with python pdfminer

I am using pdfminer to extract data from pdf files using python. I would like to extract all the data presented in pdf, regardless of whether it is an image or text or something else. Can we do this in one line (or two, if necessary, without much work). Any help is appreciated. thanks in advance

+4
source share
3 answers

Can we do this in one line (or two, if necessary, without much work).

No, you can’t. Pdfminer is powerful, but rather low level.

Unfortunately, the documentation is not exhaustive. I managed to find a way around it thanks to some Denis Papatanasiou code. The code is discussed on his blog , and you can find the source here: layout_scanner.py

See also this answer where I will tell a little more.

+7
source

For Python 3:

pip install pdfminer.six

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIO def convert_pdf_to_txt(path, codec='utf-8'): rsrcmgr = PDFResourceManager() retstr = StringIO() laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text 
+2
source

for python3, there is another one: pip install pdfminer3k

 from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io import StringIO import time from functools import wraps def fn_timer(function)://this is for calculating the run time(function) @wraps(function) def function_timer(*args, **kwargs): t0 = time.time() result = function(*args, **kwargs) t1 = time.time() print ("Total time running %s: %s seconds" % ('test', str(t1-t0)) ) return result return function_timer @fn_timer def convert_pdf(path, pages): rsrcmgr = PDFResourceManager() retstr = StringIO() laparams = LAParams() device = TextConverter(rsrcmgr, retstr, laparams=laparams) fp = open(path, 'rb') process_pdf(rsrcmgr, device, fp,pages) fp.close() device.close() str = retstr.getvalue() retstr.close() return str file = r'M:\a.pdf' print(convert_pdf(file,[1,])) 
0
source

Source: https://habr.com/ru/post/1485215/


All Articles