The new graph search on facebook allows you to search for current employees of the company using the request token - Current Google employees (for example).
I want to delete the results page ( http://www.facebook.com/search/104958162837/employees/present ) using scrapy.
The initial problem was that facebook only allowed the facebook user to access the information, so directing me to login.php. Therefore, before clearing this URL, I logged in via scrapy and then this results page. But even if the HTTP response is 200 for this page, it does not interrupt any data. The code is as follows:
import sys from scrapy.spider import BaseSpider from scrapy.http import FormRequest from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from scrapy.http import Request class DmozSpider(BaseSpider): name = "test" start_urls = ['https://www.facebook.com/login.php']; task_urls = [query] def parse(self, response): return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)] def after_login(self,response): if "authentication failed" in response.body: self.log("Login failed",level=log.ERROR) return return Request(query, callback=self.page_parse) def page_parse(self,response): hxs = HtmlXPathSelector(response) print hxs items = hxs.select('//div[@class="_4_yl"]') count = 0 print items
What could I skip or do wrong?
Thanks in advance.
source share