How to extract table data from PDF as CSV from command line?

I want to extract all the rows from here , ignoring the column headers, as well as all the page headers, i.e. Supported Devices .

 pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \ | sed '$d' \ | sed -r 's/ +/,/g; s/ //g' \ > output.csv 

The resulting file should be in CSV table format (value fields separated by commas).

In other words, I want to improve the above command so that the output does not slow down at all. Any ideas?

+6
source share
3 answers

I also offer you another solution.

While the pdftotext method works with reasonable effort in this case, there may be times when not every page has the same column widths (like your pretty soft PDF shows).

The not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor best choice here.

I myself use GitHub direct check:

 $ cd $HOME ; mkdir svn-stuff ; cd svn-stuff $ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor 

I wrote myself a pretty simple shell script like this:

 $ cat ~/bin/tabulaextr #!/bin/bash cd ${HOME}/svn-stuff/git.tabula-extractor/bin ./tabula $@ 

Since ~/bin/ is in my $PATH , I just run

 $ tabulaextr --pages all \ $(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \ | tee my.csv 

to extract all tables from all pages and convert them to a single CSV file.

The first of ten (out of 8727) CVS lines looks like this:

 $ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv Retail Branding,Marketing Name,Device,Model "","",AD681H,Smartfren Andromax AD681H "","",FJL21,FJL21 "","",Luno,Luno "","",T31,Panasonic T31 "","",hws7721g,MediaPad 7 Youth 2 3Q,OC1020A,OC1020A,OC1020A 7Eleven,IN265,IN265,IN265 AOI ELECTRONICS FACTORY,AOI,TR10CS1_11,TR10CS1 AG Mobile,Status,Status,Status 

which in the original PDF format are as follows:

Screenshot from top of first page of sample PDF

He even got these lines on the last page, 293, on the right:

  nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A 

which look on the PDF page as follows:

last page of sample PDF

TabulaPDF and Tabula-Extractor are really, really cool for such tasks!


Update

Here is the ASCiinema screencast (which you can also download and play locally on your Linux / MacOSX / Unix using the asciinema command line asciinema ), starring Tabula-Extractor :

asciicast

+11
source

What you want is pretty simple, but you have a different problem (I'm not sure you know this ...).

First, you must add -nopgbrk for ("No page breaks, please!") To your team. Since these annoying ^L characters, which otherwise appear on the output, do not need to be filtered out later.

Adding grep -vE '(Supported Devices|^$)' then filter out all lines you don't want, including empty lines or lines with spaces only:

 pdftotext -layout -nopgbrk \ DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \ | grep -vE '(Supported Devices|^$|Marketing Name)' \ | gsed '$d' \ | gsed -r 's# +#,#g' \ | gsed '# ##g' \ > output2.csv 

However, your other problem is this:

  • Some table fields are empty.
  • Empty fields appear with the -layout option as a series of spaces, sometimes even two on the same line.
  • However, text columns are not equally spaced from page to page.
  • Therefore, you will not know from the line to indicate how many spaces you need to consider as an "empty CSV field" (where you need an extra , delimiter).
  • As a result, your current code will only show one, two or three (instead of four) fields for some rows, and these fields fall into the wrong columns!

There is a workaround for this:

  • Add the -x ... -y ... -W ... -H ... options to pdftotext to crop the PDF column.
  • Then add columns with a combination of utilities like paste and column .

The following command retrieves the first columns:

 pdftotext -layout -x 38 -y 77 -W 176 -H 500 \ DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 1st-columns.txt 

This is for the second, third and fourth columns:

 pdftotext -layout -x 214 -y 77 -W 176 -H 500 \ DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 2nd-columns.txt pdftotext -layout -x 390 -y 77 -W 176 -H 500 \ DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 3rd-columns.txt pdftotext -layout -x 567 -y 77 -W 176 -H 500 \ DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 4th-columns.txt 

By the way, I cheated a little: to find out what values ​​to use for -x , -y , -W and -H , I first ran this command to find the exact coordinates of the column header words:

 pdftotext -f 1 -l 1 -layout -bbox \ DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - | head -n 10 

It is always good if you know how to read and use pdftotext -h . pdftotext -h

In any case, how to add four text files in the form of columns next to each other, with the corresponding CVS separator between them, you must figure out for yourself. Or ask a new question :-)

+3
source

As Martin R. commented , tabula-java is a new version of tabula-extractor and is active. 1.0.0 was released on July 21, 2017.

Download the jar file and with the latest java:

 java -jar ./tabula-1.0.0-jar-with-dependencies.jar \ --pages=all \ ./DAC06E7D1302B790429AF6E84696FCFAB20B.pdf > support_devices.csv 
0
source

Source: https://habr.com/ru/post/987440/


All Articles