Heroku and web scraping

I have a nokigiri scraper that publishes a database that I am trying to publish to heroku. I have a synatra application interface that I want to get from the database. I'm new to Heroku and web development, and don't know how to deal better with something like this.

Should I put a web scraper script that loads into the database along the sinatra route (e.g. mywebsite.com/scraper) and just makes it so obscure that no one visits it? In the end, I would like part of the sinatra to be a rest api that pulls from the database.

Thanks for all the input.

+6
source share
2 answers

There are two approaches you can take.

The first is to use one-time dynodes by running the scraper through the console using heroku run YOURCMD . Just make sure that the scraper is not written to disk, but uses the database.

Additional information: https://devcenter.heroku.com/articles/one-off-dynos

The second is the difference between a scraper and a web process in such a way that you have a web process for the usual interaction with the user interface and a scraper process with which the web process can appear / talk. If you go through this route, they will help you protect it from the rest of the world (obfuscating auth / url, etc.).

Additional information: https://devcenter.heroku.com/articles/background-jobs-queueing

+3
source

I did this by creating a rake task and using the one-time dynodes mentioned by XLII

Here is my rake task file

 require 'bundler/setup' Bundler.require desc "Scrape Site" task :scrape, [:companyname] => :environment do |t, args| puts "Company Name is :" + args[:companyname] agent = Mechanize.new agent.user_agent_alias = 'Mac Safari' puts "Agent (Mac Safari Created)" # MORE SCRAPING CODE end 

You can just run it on call

 heroku run rake scrape[google] 
0
source

Source: https://habr.com/ru/post/949277/


All Articles