Extract tables from PDF files in Ruby

What is the best way to extract spreadsheets embedded in PDF documents?

I don't need solutions that work only for JRuby, or that use third-party APIs or websites.

Can you share some Ruby code with how to retrieve tables (tables)? What gems are best suited for the job?

I'm sure someone had the same problem before :) I appreciate your help!

+5
source share
3 answers

You can extract data from pdf using poppler . Depending on your exact requirements, this may be enough.

def extract_to_text(pdf_path) command = ['pdftotext', Shellwords.escape(pdf_path)].join(' ') `#{command}` end def extract_to_html(pdf_path) command = ['pdftohtml', Shellwords.escape(pdf_path)].join(' ') `#{command}` end 

These commands will extract the pdf files to the html file and text file, respectively, stored in the same place where your pdf file was.

You can install poppler on mac with homebrew:

 brew install poppler 
+1
source

You can see this answer ( How to convert PDF to Excel or CSV in Rails 4 ). It solves the same problem you are trying to solve.

+3
source

Checkout this gem I think this is what you are looking for: pdf-reader gem

+2
source

Source: https://habr.com/ru/post/1263512/


All Articles