Analyze PDF document tables

The PDF in this link ( http://www.lenovo.com/psref/pdf/psref450.pdf ) contains several tables like this:

enter image description here

I would like to programmatically extract data and structure from these tables.

Things I've tried: converting PDF to HTML using

  • Tika . Unfortunately, tables are converted to paragraph separator - and some lines contain spaces, so it is impossible to break them.
  • Python PDFMiner : returned an assertion error due to a lack of fonts. I suspect that the HTML would be similar to the output from the Tick, although I need to solve the problem with the missing fonts in order to confirm this.
  • Online tools . I tried http://www.zamzar.com/ and a couple of others. The file was too large to process (for online services) or it generated errors.

I planned to convert PDF to HTML and then parse it using BeautifulSoup.

The output can be JSON (for example, one object in a table), XML, or almost any format that supports the structure.

+4
source share
3 answers

You can try PDFBox. The documentation for this is here:

https://pdfbox.apache.org/1.8/cookbook/textextraction.html

org.apache.pdfbox.pdfviewer.PDFPageDrawer strokePath. . , , // . , , , , .

+5

@alex-woolford: ( , PDF ) , , 100%. , , . , , , PDF , .. PDF 100% - . , , , ( ). , -, PDF. . PostScript . PDF -, PDF, .

, , , (, Python, ). - xpdf, - PDFTextStream. , . xpdf - C, . PDFTextStream - / Java. , , , IIRC.

, xpdf C PDFTextStream Java, Python XML-RPC - / , . , , .

.

+1

FYI, : , , . - - , :

2469-2TU    i5-3320M    4GBx1   14.0" HD    720p    500G 7200   Intel 620528    WWAN upg    Express 54  Finger  BT  6   Win7 Pro64  10/12
βœ‚ 2469-2SU  i5-3210M    4GBx1   14.0" HD    720p    500G 7200   Intel 2200  WWAN upg    Express 54  None    None    6   Win7 Pro64  10/12
βœ‚ 2469-2RU  i3-3110M    4GBx1   14.0" HD    720p    320G 7200   Intel 2200  WWAN upg    Express 54  None    None    6   Win7 Pro64  10/12
2469-32U    i5-3230M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13
2469-2ZU    i5-3230M    4GBx1   14.0" HD    720p    320G 7200   Intel 2200  WWAN upg    None    None    None    6   Win7 Pro64  02/13
2469-2YU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13
2469-2XU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   Intel 6205  WWAN upg    None    None    None    6   Win7 Pro64  02/13
2469-2WU    i5-3320M    4GBx1   14.0" HD    720p    320G 7200   WLAN upg    WWAN upg    None    Finger  BT  6   Win7 Pro64  02/13

I’m the second PDFBox, since it works similarly to my own manual utility: poll (x, y) positions, sort, and then insert the β€œprobable” lines and insert the tab when the horizontal space is more than reasonably expected.

I even got little scissors at Zipf Dingbats :)

+1
source

Source: https://habr.com/ru/post/1533232/


All Articles