Scrapy Crawling URL order with long list of start_urls and spider yiels URLs

Help! Reading Scrapy source code is not easy for me. I have a very long start_urls list. it is about 3,000,000 per file. So, I do start_urls as follows:

 start_urls = read_urls_from_file(u"XXXX") def read_urls_from_file(file_path): with codecs.open(file_path, u"r", encoding=u"GB18030") as f: for line in f: try: url = line.strip() yield url except: print u"read line:%s from file failed!" % line continue print u"file read finish!" 

MeanWhile, my spider callback functions are as follows:

  def parse(self, response): self.log("Visited %s" % response.url) return Request(url=("http://www.baidu.com"), callback=self.just_test1) def just_test1(self, response): self.log("Visited %s" % response.url) return Request(url=("http://www.163.com"), callback=self.just_test2) def just_test2(self, response): self.log("Visited %s" % response.url) return [] 

my questions:

  • order of URLs used by bootloader? Will requests made by just_test1 , just_test2 used by the loader only after all start_urls (I did some tests, it seems that the answer is No)
  • What decides the order? Why and how is this order? How can we control this?
  • Is this a good way to deal with so many URLs that are already in the file? What else?

Thank you very much!!!

Thanks for answers. But I'm still a bit confused: By default, Scrapy uses the LIFO queue to store pending requests.

  • The requests function created by the spider callback function will be passed to the scheduler . Who does the same with start_url requests ? Spider The start_requests() function only generates an iterator without providing real requests.
  • Will all requests (start_url and callback's) be in the same request queue? How many queues are there in Scrapy ?
+4
source share
1 answer

First of all, see this topic - I think you will find answers to all questions.

order of URLs used by bootloader? Will requests made by just_test1 be made, just_test2 will be used by the bootloader only after all start_urls are used? (I did some tests, it seems that the answer is No)

You are right, answer No The behavior is completely asynchronous: when the spider starts, the start_requests method is start_requests ( source ):

 def start_requests(self): for url in self.start_urls: yield self.make_requests_from_url(url) def make_requests_from_url(self, url): return Request(url, dont_filter=True) 

What determines the order? Why and how is this order? How can we control this?

By default, there is no predetermined order — you cannot know when Requests comes from make_requests_from_url — it is asynchronous.

See this answer for how you can control the order. In short, you can override start_requests and mark the received Requests with the priority key (for example, yield Request(url, meta={'priority': 0}) ). For example, the priority value may be the line number where the url is found.

Is this a good way to handle so many URLs that are already in a file? What else?

I think you should read your file and get the URLs directly in the start_requests method: see this answer .

So you should do something like this:

 def start_requests(self): with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f: for index, line in enumerate(f): try: url = line.strip() yield Request(url, meta={'priority': index}) except: continue 

Hope this helps.

+6
source

Source: https://habr.com/ru/post/1483961/


All Articles