Anemone Ruby with focus_crawl

I work to crawl, but before I crawl the entire site, I would like to take a test, for example, of a page. So I thought something like below would work, but I keep getting a tote ...

Anemone.crawl(self.url) do |anemone|
      anemone.focus_crawl do |crawled_page|
        crawled_page.links.slice(0..10)
        page = pages.find_or_create_by_url(crawled_page.url)
        logger.debug(page.inspect)
        page.check_for_term(self.term, crawled_page.body)
      end
    end

NoMethodError (private method `select' called for true:TrueClass):
    app/models/site.rb:14:in `crawl'
    app/controllers/sites_controller.rb:96:in `block in crawl'
    app/controllers/sites_controller.rb:95:in `crawl'

Basically, I want to have a way to first craw only 10 pages, but I don't seem to understand the basics here. Can anybody help me? Thank!!

+3
source share
3 answers

Add this monkeypatch file to your scan file.

module Anemone
    class Core
        def kill_threads
            @tentacles.each { |thread| 
                Thread.kill(thread)  if thread.alive?
            }
        end
    end
end

Here is an example of how to use it after you add it to your crawl file. Then, in the file you are using, add it to your anemone.on_every_page method

@counter = 0
Anemone.crawl(http://stackoverflow.com, :obey_robots => true) do |anemone|
    anemone.on_every_page do |page|
        @counter+= 1 
        if @counter > 10
            anemone.kill_threads
        end
    end
end

: https://github.com/chriskite/anemone/issues/24

+1

, depth_limit, , .

0

I found your question while I was looking for an anemone.

I had the same problem. And with Anemone, what I did:

As soon as I get the restriction on the URL I want, I throw an exception. The entire anemone block is inside the start / rescue block.

In your particular case, I would take a different approach. I would upload the page you want to analyze and attach it to fakeweb . I wrote a blog entry about this, a long time ago, maybe it would be useful: http://blog.bigrails.com/scraper-guide.html

0
source

Source: https://habr.com/ru/post/1777386/


All Articles