Reset Facebook data with scrapy

The new graph search on facebook allows you to search for current employees of the company using the request token - Current Google employees (for example).

I want to delete the results page ( http://www.facebook.com/search/104958162837/employees/present ) using scrapy.

The initial problem was that facebook only allowed the facebook user to access the information, so directing me to login.php. Therefore, before clearing this URL, I logged in via scrapy and then this results page. But even if the HTTP response is 200 for this page, it does not interrupt any data. The code is as follows:

import sys from scrapy.spider import BaseSpider from scrapy.http import FormRequest from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from scrapy.http import Request class DmozSpider(BaseSpider): name = "test" start_urls = ['https://www.facebook.com/login.php']; task_urls = [query] def parse(self, response): return [FormRequest.from_response(response, formname='login_form',formdata={'email':'myemailid','pass':'myfbpassword'}, callback=self.after_login)] def after_login(self,response): if "authentication failed" in response.body: self.log("Login failed",level=log.ERROR) return return Request(query, callback=self.page_parse) def page_parse(self,response): hxs = HtmlXPathSelector(response) print hxs items = hxs.select('//div[@class="_4_yl"]') count = 0 print items 

What could I skip or do wrong?

Thanks in advance.

+4
source share
1 answer

The problem is that search results (in particular div initial_browse_result ) are loaded dynamically through javascript. Scrapy gets a page before these steps, so there are no results yet.

Basically, you have two options:

If you go with the first option, you should analyze all the requests that go during the loading of the page and find out which one is responsible for obtaining the data that you want to clear.

The second is pretty simple, but it will definitely work - you just use another tool to load the page using js data, and then analyze it for scrapy objects.

Hope this helps.

+8
source

Source: https://habr.com/ru/post/1483848/


All Articles