How to convert PDF to Excel or CSV in Rails 4

I searched a lot. I have no choice unless you ask for it here. Do you guys know an online converter with API or Gem / s that can convert PDF to Excel or CSV file?

I'm not sure if this is the best place to ask about this.

My application is in Rails 4.2. The PDF file contains a header and a large table with approximately 10 columns.

Additional information: The user uploads the PDF through the form, then I need to capture the PDF in order to analyze it in CSV and read the contents. I tried to read the content using the PDF Reader Gem, however the result was not really promising.

I used: freepdfconvert.com/pdf-excel Unfortunately, do not ship the API. (I contacted them)

PDF example

enter image description here

This piece of code converts PDF to text, which is convenient. Gem: pdf-reader

  def self.parse reader = PDF::Reader.new("pdf_uploaded_by_user.pdf") reader.pages.each do |page| puts page.text end end 

Now, if you check the sample PDF attached, you will see that some fields may be empty, which means that I just can’t split the text string into space and put it in an array, since I cannot match the array in the correct fields.

Thanks.

+3
source share
3 answers

Well, after much research, I could not find an API or even suitable software that does this. Here is how I did it.

First I extract the Table from the PDF to the Table using this pdftables API. It is cheap.

Then I convert the HTML table to CSV.

(This is not perfect, but it works)

Here is the code:

 require 'httmultiparty' class PageTextReceiver include HTTMultiParty base_uri 'http://localhost:3000' def run response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") }) File.open('/path/to/save/as/html/response.html', 'w') do |f| f.puts response end end def convert f = File.open("/path/to/saved/html/response.html") doc = Nokogiri::HTML(f) csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true}) doc.xpath('//table/tr').each do |row| tarray = [] row.xpath('td').each do |cell| tarray << cell.text end csv << tarray end csv.close end end 

Now run it like this:

 #> page = PageTextReceiver.new #> page.run #> page.convert 

This is not refactoring. Just a proof of concept. You should consider performance.

I could use the Sidekiq gem to run it in the background and move the result to the main thread.

+4
source

Ryan Bates covers csv exports on his rails> http://railscasts.com/episodes/362-exporting-csv-and-excel , this may give you some pointers.

Edit: since you are now specifying that you need raw data from the downloaded PDF file, you can use JavaScript to read the PDF file and populate the data in the Ryan Bates export method. PDF reading was considered excellent in the following question:

extract text from pdf in javascript

I would suggest that the stream would be something like this:

 PDF new action user uploads PDF PDF show action PDF is displayed JavaScript reads PDF JavaScript populates Ryan raw data Raw data is exported with PDF data included 
+1
source

Check the Tabula-Extractor project and see how it is used in projects such as the Parser Moving Summonses NYPD and the CompStat Criminal Complaint Compiler .

+1
source

Source: https://habr.com/ru/post/1263513/


All Articles