Well, after much research, I could not find an API or even suitable software that does this. Here is how I did it.
First I extract the Table from the PDF to the Table using this pdftables API. It is cheap.
Then I convert the HTML table to CSV.
(This is not perfect, but it works)
Here is the code:
require 'httmultiparty' class PageTextReceiver include HTTMultiParty base_uri 'http://localhost:3000' def run response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") }) File.open('/path/to/save/as/html/response.html', 'w') do |f| f.puts response end end def convert f = File.open("/path/to/saved/html/response.html") doc = Nokogiri::HTML(f) csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true}) doc.xpath('//table/tr').each do |row| tarray = [] row.xpath('td').each do |cell| tarray << cell.text end csv << tarray end csv.close end end
Now run it like this:
#> page = PageTextReceiver.new
This is not refactoring. Just a proof of concept. You should consider performance.
I could use the Sidekiq gem to run it in the background and move the result to the main thread.
source share