Problem using scrapy to clean a yahoo group

Question

Problem using scrapy to clean a yahoo group

I am new to web scraping and just started experimenting with Scrapy , a scrambling map written in Python. My goal is to clean up the old Yahoo group, as they do not provide an API or other means to extract message archives. The Yahoo group is set up so that you need to be logged in before you can browse the archives.

I need to follow these steps:

Log in to Yahoo
Visit the url of the first post and clear it
Repeat step 2 for the next message, etc.

I started to draw a spider to accomplish the above, and here's what I still have. All I want to notice is that the login is working and I can receive the first message. I will finish the rest as soon as I get this work:

class Sg101Spider(BaseSpider): name = "sg101" msg_id = 1 # current message to retrieve max_msg_id = 21399 # last message to retrieve def start_requests(self): return [FormRequest(LOGIN_URL, formdata={'login': LOGIN, 'passwd': PASSWORD}, callback=self.logged_in)] def logged_in(self, response): if response.url == 'http://my.yahoo.com': self.log("Successfully logged in. Now requesting 1st message.") return Request(MSG_URL % self.msg_id, callback=self.parse_msg, errback=self.error) else: self.log("Login failed.") def parse_msg(self, response): self.log("Got message!") print response.body def error(self, failure): self.log("I haz an error")

When I launch the spider, I see that it logs in and issues a request for the first message. However, all I see in the debug output from scrapy is 3 redirects that end up in the URL that I asked first. But scrapy does not call my callback parse_msg() , and the scan stops. Here is a snippet of research results:

 2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened 2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login> 2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> 2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None) 2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message. 2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> 2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> 2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> 2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished) 2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)

I can’t figure it out. It seems like Yahoo is redirecting the spider (perhaps for authentication?), But it seems to be returning to the URL I wanted to visit in the first place. But scrapy does not call my callback, and I have no way to clear the data or continue scanning.

Does anyone have any ideas on what is going on and / or how to debug this further? Thanks!

+4

python scrapy screen-scraping yahoo

Brian neal Feb 04 '11 at 2:13

source share

1 answer

Brian neal · Accepted Answer · 2011-02-04T04:43:47+0000

I think Yahoo redirects the authorization check, and finally redirects me back to the page that I really wanted to get. However, Scrapy has already seen this request and stops because it does not want to get into the loop. The solution, in my case, is to add dont_filter=True to the query constructor. This will instruct Scrapy not to filter out repeated requests. This is good in my case, because I know in advance which URLs I want to crawl.

 def logged_in(self, response): if response.url == 'http://my.yahoo.com': self.log("Successfully logged in. Now requesting message page.", level=log.INFO) return Request(MSG_URL % self.msg_id, callback=self.parse_msg, errback=self.error, dont_filter=True) else: self.log("Login failed.", level=log.CRITICAL)

Problem using scrapy to clean a yahoo group

More articles: