Heroku and web scraping

Question

Heroku and web scraping

I have a nokigiri scraper that publishes a database that I am trying to publish to heroku. I have a synatra application interface that I want to get from the database. I'm new to Heroku and web development, and don't know how to deal better with something like this.

Should I put a web scraper script that loads into the database along the sinatra route (e.g. mywebsite.com/scraper) and just makes it so obscure that no one visits it? In the end, I would like part of the sinatra to be a rest api that pulls from the database.

Thanks for all the input.

+6

ruby api web-services heroku sinatra

John lamburger Jul 12 '13 at 0:40

source share

2 answers

Xlii · Answer 1 · 2013-08-06T11:28:32+0000

There are two approaches you can take.

The first is to use one-time dynodes by running the scraper through the console using heroku run YOURCMD . Just make sure that the scraper is not written to disk, but uses the database.

Additional information: https://devcenter.heroku.com/articles/one-off-dynos

The second is the difference between a scraper and a web process in such a way that you have a web process for the usual interaction with the user interface and a scraper process with which the web process can appear / talk. If you go through this route, they will help you protect it from the rest of the world (obfuscating auth / url, etc.).

Additional information: https://devcenter.heroku.com/articles/background-jobs-queueing

user706001 · Answer 2 · 2014-05-14T10:47:13+0000

I did this by creating a rake task and using the one-time dynodes mentioned by XLII

Here is my rake task file

 require 'bundler/setup' Bundler.require desc "Scrape Site" task :scrape, [:companyname] => :environment do |t, args| puts "Company Name is :" + args[:companyname] agent = Mechanize.new agent.user_agent_alias = 'Mac Safari' puts "Agent (Mac Safari Created)" # MORE SCRAPING CODE end

You can just run it on call

 heroku run rake scrape[google]

Heroku and web scraping

More articles: