Reset Facebook data with scrapy

Question

Reset Facebook data with scrapy

The new graph search on facebook allows you to search for current employees of the company using the request token - Current Google employees (for example).

I want to delete the results page ( http://www.facebook.com/search/104958162837/employees/present ) using scrapy.

The initial problem was that facebook only allowed the facebook user to access the information, so directing me to login.php. Therefore, before clearing this URL, I logged in via scrapy and then this results page. But even if the HTTP response is 200 for this page, it does not interrupt any data. The code is as follows:

import sys from scrapy.spider import BaseSpider from scrapy.http import FormRequest from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from scrapy.http import Request class DmozSpider(BaseSpider): name = "test" start_urls = ['https://www.facebook.com/login.php']; task_urls = [query] def parse(self, response): return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)] def after_login(self,response): if "authentication failed" in response.body: self.log("Login failed",level=log.ERROR) return return Request(query, callback=self.page_parse) def page_parse(self,response): hxs = HtmlXPathSelector(response) print hxs items = hxs.select('//div[@class="_4_yl"]') count = 0 print items

What could I skip or do wrong?

Thanks in advance.

+4

facebook web web-crawler scrapy

Aryabhatt May 31 '13 at 18:16

source share

1 answer

alecxe · Answer 1 · 2013-05-31T20:13:13+0000

The problem is that search results (in particular div initial_browse_result ) are loaded dynamically through javascript. Scrapy gets a page before these steps, so there are no results yet.

Basically, you have two options:

try to simulate these js (XHR) requests in scrapy, see
- Scrambling ajax pages with python
- Can scrapy be used to clean up dynamic content from websites using AJAX?
use a combination of scrapy and selenium , or scrapy and mechanize to load the entire content page, see
- Executing Javascript form functions with scrapy in python
- this answer

If you go with the first option, you should analyze all the requests that go during the loading of the page and find out which one is responsible for obtaining the data that you want to clear.

The second is pretty simple, but it will definitely work - you just use another tool to load the page using js data, and then analyze it for scrapy objects.

Hope this helps.

Reset Facebook data with scrapy

More articles: