First of all, see this topic - I think you will find answers to all questions.
order of URLs used by bootloader? Will requests made by just_test1 be made, just_test2 will be used by the bootloader only after all start_urls are used? (I did some tests, it seems that the answer is No)
You are right, answer No The behavior is completely asynchronous: when the spider starts, the start_requests method is start_requests ( source ):
def start_requests(self): for url in self.start_urls: yield self.make_requests_from_url(url) def make_requests_from_url(self, url): return Request(url, dont_filter=True)
What determines the order? Why and how is this order? How can we control this?
By default, there is no predetermined order — you cannot know when Requests comes from make_requests_from_url — it is asynchronous.
See this answer for how you can control the order. In short, you can override start_requests and mark the received Requests with the priority key (for example, yield Request(url, meta={'priority': 0}) ). For example, the priority value may be the line number where the url is found.
Is this a good way to handle so many URLs that are already in a file? What else?
I think you should read your file and get the URLs directly in the start_requests method: see this answer .
So you should do something like this:
def start_requests(self): with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f: for index, line in enumerate(f): try: url = line.strip() yield Request(url, meta={'priority': index}) except: continue
Hope this helps.