If we look at the official documentation . The process is simple.
First, your start URL is parsed, and then each subsequent crawl link will be retrieved according to the rules provided.
Now to your question.
Now my question is, the links that are extracted from the FIRST LinkExtractor Rule, are they just scheduled in the scheduler and not immediately followed? Therefore, after he plans all the links that are extracted from the first LinkExtractor, then he will call the first_callback method for all these links with the transmitted response to this first_callback?
If the None callback matches the default values โโof True , otherwise it defaults to False . This means that in your case there will be no continuation. Regardless of which link was extracted from the response to the starting URL, this is what you will have in the scheduler, and your crawl will end after parsing all of these.
If you want to follow, break the rules. Find where your content is and where the resources are.
# Extract links matching 'products' (but not matching 'shampoo')
Now, going to your second question:
Also, when will the second LinkExtractor be called? Does first LinkExtractor get an estimate, and then only the second LinkExtractor works?
One is independent of the other. LinkExtractor The object applies regular expression or string matching independently. If they find a suitable URL, they will execute their callbacks or keep track of them.
Rahul source share