I am new to web scraping and just started experimenting with Scrapy , a scrambling map written in Python. My goal is to clean up the old Yahoo group, as they do not provide an API or other means to extract message archives. The Yahoo group is set up so that you need to be logged in before you can browse the archives.
I need to follow these steps:
- Log in to Yahoo
- Visit the url of the first post and clear it
- Repeat step 2 for the next message, etc.
I started to draw a spider to accomplish the above, and here's what I still have. All I want to notice is that the login is working and I can receive the first message. I will finish the rest as soon as I get this work:
class Sg101Spider(BaseSpider): name = "sg101" msg_id = 1 # current message to retrieve max_msg_id = 21399 # last message to retrieve def start_requests(self): return [FormRequest(LOGIN_URL, formdata={'login': LOGIN, 'passwd': PASSWORD}, callback=self.logged_in)] def logged_in(self, response): if response.url == 'http://my.yahoo.com': self.log("Successfully logged in. Now requesting 1st message.") return Request(MSG_URL % self.msg_id, callback=self.parse_msg, errback=self.error) else: self.log("Login failed.") def parse_msg(self, response): self.log("Got message!") print response.body def error(self, failure): self.log("I haz an error")
When I launch the spider, I see that it logs in and issues a request for the first message. However, all I see in the debug output from scrapy is 3 redirects that end up in the URL that I asked first. But scrapy does not call my callback parse_msg() , and the scan stops. Here is a snippet of research results:
2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened 2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login> 2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> 2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None) 2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message. 2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> 2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> 2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> 2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished) 2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)
I canβt figure it out. It seems like Yahoo is redirecting the spider (perhaps for authentication?), But it seems to be returning to the URL I wanted to visit in the first place. But scrapy does not call my callback, and I have no way to clear the data or continue scanning.
Does anyone have any ideas on what is going on and / or how to debug this further? Thanks!