Using phantoms for dynamic content based on the state of shells and selenium

Firstly, this is the next question from here: Change the number of scrapyd spiders launched

I use phantomjs and selenium to create a bootloader middleware for my scrapy project. It works well and didn’t actually slow down when I launch my spiders one at a time.

But more recently, I installed the scrapyd server on AWS. I noticed a possible race condition, which, apparently, causes errors and performance problems while running more than one spider. I feel the problem is related to two separate problems.

1) Spiders trying to use phantomjs executables at the same time.

2) Spiders trying to log into the phhostomjs ghostdriver log file at the same time.

Guessing here, a performance problem might be a spider trying to wait until resources are available (this could be due to the fact that I also had a race condition for the sqlite database).

Here are the errors I get:

exceptions.IOError: [Errno 13] Permission denied: 'ghostdriver.log' (race file status of log files?)

selenium.common.exceptions.WebDriverException: Message: "Cannot connect to GhostDriver" (condition for an executable race?)

My questions:

Is my analysis that the problem seems correct?

Are there any known solutions to this problem besides limiting the number of spiders that can run at a time?

Is there any other way to handle javascript? (if you think I should create a completely new question to discuss the best way to handle javascript using scrapy, let me know and I will do it)

Here is my bootloader middleware:

class JsDownload(object): @check_spider_middleware def process_request(self, request, spider): if _platform == "linux" or _platform == "linux2": driver = webdriver.PhantomJS(service_log_path='/var/log/scrapyd/ghost.log') else: driver = webdriver.PhantomJS(executable_path=settings.PHANTOM_JS_PATH) driver.get(request.url) return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8')) 

note: _platform code is temporary work until I get this source code deployed in a static environment.

I found solutions on SO for the javascript problem, but they were based on spiders. This bothered me because it meant that each request had to be made once in the loader handler and again in the spider. That's why I decided to implement mine as a bootloader middleware.

+6
source share
1 answer

try using webdriver to interact with phantomjs https://github.com/brandicted/scrapy-webdriver

+2
source

Source: https://habr.com/ru/post/972851/


All Articles