Working with tables in pdf using python

I am working on a pdf file. This pdf file contains several tables. According to the table names given in pdf, I wanted to get data from this table using python.

I worked on html, xlm parsing but never with pdf.
Can someone tell me how to get tables from pdf using python?

+4
source share
4 answers

I think you need a python parser library. The most famous PDFMiner .

According to the documentation:

PDFMiner is a tool for extracting information from PDF documents. Unlike other tools related to PDF, it focuses entirely on obtaining and analyzing text data. PDFMiner allows you to get the exact location of the text on the page, as well as other information such as fonts or lines. It includes a PDF converter that can convert PDF files to other text formats (such as HTML). It has an extensible PDF analyzer that can be used for purposes other than text analysis.

+5
source

I recently had a similar problem and wrote a library to help solve it: pdfquery .

PDFQuery creates a tree of elements from PDF (using pdfminer, with extra sugar) and allows you to retrieve elements from a page using JQuery or XPath selectors, based mainly on text content or element locations. So, to analyze the table, you will first find where it is in the document, looking for a label:

label = pdf.pq(':contains("Name of your table")') left_corner = float(label.attr('x0')) bottom_corner = float(label.attr('y0')) 

You will then continue to search for rows below the table until the search returns results:

 page = label.closest('LTPage') while 1: row = pdf.extract( [ ('column_1', ':in_bbox("%s,%s,%s,%s")' % (left_corner+10, bottom_corner+40, left_corner+50, bottom_corner+20)), ('column_2', ':in_bbox("%s,%s,%s,%s")' % (left_corner+50, bottom_corner+40, left_corner+80, bottom_corner+20)) ], page) if not row['column_1'] or row['column_2']: break print "Got row:", matches bottom_corner -= 20 

This assumes your rows are 20 points high, the first starts 20 points below the mark, the first column takes 10 to 50 points from the left edge of the mark, and the second column is 50 to 80 points from the left edge of the mark.

If you have empty lines or lines with different heights, this will be more annoying. You may also need to use the merge_tags = None parameter to select individual characters, not words, if the entries in the table are close enough to force the parser to read it as just one line. But I hope this brings you closer ...

+5
source

This is a very complex problem and is not solvable at all.

The reason for this is because the PDF format is too flexible. Some PDF files are just bitmap images (you have to make your own OCR, and then, obviously, not our topic here), some of them are a bunch of letters literally spilled on pages; this means that by parsing textual information in PDF, you can get individual characters placed in some coordinates. In some cases, this happens in an orderly way (from left to right), but in some cases you will get rather random distributions, most often with others, as well as special characters, characters of another font, etc. may fail.

The only correct approach is to place all the characters in accordance with their coordinates on the page model, and then use heuristics to find out what these lines are.

I suggest taking a look at your PDF files and the tables you want to analyze before running. Perhaps they are the same all the time and are well versed.

Good luck

+3
source

Note. But this is one of Java

This is useful for extracting data from tables inside a PDF.

PDF2Table main documentation

PDF2Table window jar

PDF2Table for Mac or Linux

-1
source

Source: https://habr.com/ru/post/1402383/


All Articles