Captchas in scrapy

Question

Captchas in scrapy

I am working on a Scrapy application where I am trying to enter a site with a form using captcha (this is not spam). I use ImagesPipeline to download the captcha, and I print it on the screen so that the user can solve it. So far so good.

My question is, how can I restart the spider, send error / form information? Right now, my spider requests a captcha page, and then returns an Item containing image_url for conversion. Then it is processed / loaded using ImagesPipeline and displayed to the user. I don’t understand how I can resume the spider’s move and pass the allowed captcha the same spider session, since I believe that the spider should return an element (for example, exit) before ImagesPipeline starts working.

I looked through the documents and examples, but I did not find a single one that would make it clear how to do this.

+6

python captcha scrapy

Kevin burke Jul 11 '11 at 5:21

source share

2 answers

Medorator · Answer 1 · 2011-07-11T11:28:39+0000

Here's how you can make it work inside the spider.

 self.crawler.engine.pause() process_my_captcha() self.crawler.engine.unpause()

As soon as you receive the request, stop the engine, show the image, read the information from the user & resume the crawl by sending a POST request to enter.

I would be interested to know if the approach is suitable for your case.

friso seyferth · Answer 2 · 2012-05-21T01:59:21+0000

I would not create an element and would not use ImagePipeline.

 import urllib import os import subprocess ... def start_requests(self): request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form) return [request] def fill_login_form(self,response): x = HtmlXPathSelector(response) img_src = x.select("//img/@src").extract() #delete the captcha file and use urllib to write it to disk os.remove("c:\captcha.jpg") urllib.urlretrieve(img_src[0], "c:\captcha.jpg") # I use an program here to show the jpg (actually send it somewhere) captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe") # OR just get the input from the user from stdin captcha = raw_input("put captcha in manually>") # this function performs the request and calls the process_home_page with # the response (this way you can chain pages from start_requests() to parse() return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)] def process_home_page(self, response): # check if you logged in etc. etc.

...

What I'm doing here, I import urllib.urlretrieve(url ) (to save the image), os.remove(file) (to delete the previous image) and subprocess.checoutput (to call the external command-line utility to solve the interceptor), All Scrapy infrastructure It is not used in this hack because the solution to this problem is always a hack.

All this challenging external subprocess could be better, but it works.

On some sites, it is not possible to save the captcha image, and you must call the page in a browser and call the screen_capture utility and crop in the exact place to “cut” the captcha. Now this is screen shielding.

Captchas in scrapy

More articles: