Extract tables from PDF files in Ruby

Question

Extract tables from PDF files in Ruby

What is the best way to extract spreadsheets embedded in PDF documents?

I don't need solutions that work only for JRuby, or that use third-party APIs or websites.

Can you share some Ruby code with how to retrieve tables (tables)? What gems are best suited for the job?

I'm sure someone had the same problem before :) I appreciate your help!

+5

ruby ruby-on-rails

Tilo Jan 28 '17 at 19:16

source share

3 answers

You can see this answer ( How to convert PDF to Excel or CSV in Rails 4 ). It solves the same problem you are trying to solve.

+3

Damian simon peter Jan 29 '17 at 19:03

source share

Checkout this gem I think this is what you are looking for: pdf-reader gem

+2

Zach tuttle Jan 31 '17 at 17:06

source share

Bigron · Accepted Answer · 2017-02-03T05:10:28+0000

You can extract data from pdf using poppler . Depending on your exact requirements, this may be enough.

def extract_to_text(pdf_path) command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ') `#{command}` end def extract_to_html(pdf_path) command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ') `#{command}` end

These commands will extract the pdf files to the html file and text file, respectively, stored in the same place where your pdf file was.

You can install poppler on mac with homebrew:

 brew install poppler

Extract tables from PDF files in Ruby

More articles: