Scrapy with selenium for webpage requiring authentication

I am trying to clear data from a page that has a lot of AJAX calls and javascript execution to display a webpage. Therefore, I am trying to use scrapy with selenium for this. Modus operandi is as follows:

  • Add the URL of the scrapy start_urls entry page

  • Use the formrequest method from the response method to publish the username and password for authentication.

  • After logging in, request the page you want to clear.
  • Pass this answer to Selenium web browser to click the buttons on the page.
  • After clicking the buttons and creating a new web page, write down the result.

The code that I still have is as follows:

from scrapy.spider import BaseSpider from scrapy.http import FormRequest, Request from selenium import webdriver import time class LoginSpider(BaseSpider): name = "sel_spid" start_urls = ["http://www.example.com/login.aspx"] def __init__(self): self.driver = webdriver.Firefox() def parse(self, response): return FormRequest.from_response(response, formdata={'User': 'username', 'Pass': 'password'}, callback=self.check_login_response) def check_login_response(self, response): if "Log Out" in response.body: self.log("Successfully logged in") scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500" yield Request(url=scrape_url, callback=self.parse_page) else: self.log("Bad credentials") def parse_page(self, response): self.driver.get(response.url) next = self.driver.find_element_by_class_name('dxWeb_pNext') next.click() time.sleep(2) # capture the html and store in a file 

The two control blocks that I still got to are the following:

  • Step 4 does not work. When selenium opens a firefox window, it is always on the login screen and does not know how to get past it.

  • I do not know how to reach step 5

Any help would be greatly appreciated.

+3
python selenium scrapy
Feb 09 '15 at 21:50
source share
2 answers

I do not believe that you can switch between sample requests and selenium. You need to go to the site using selenium and not execute Request (). The registration session that you created with scrapy does not transfer to the selenium session. Here is an example (the ids / xpath element will be different for you):

  scrape_url = "http://www.example.com/authen_handler.aspx" driver.get(scrape_url) time.sleep(2) username = self.driver.find_element_by_id("User") password = self.driver.find_element_by_name("Pass") username.send_keys("your_username") password.send_keys("your_password") self.driver.find_element_by_xpath("//input[@name='commit']").click() 

then you can do:

  time.sleep(2) next = self.driver.find_element_by_class_name('dxWeb_pNext').click() time.sleep(2) 

etc..

EDIT: if you need to display javascript and are worried about speed / non-blocking, you can use http://splash.readthedocs.org/en/latest/index.html , which should do the trick.

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie contains information about the transmission of the cookie, you should be able to transfer it from scrapy, but I have not done this before.

+2
Feb 10 '15 at 0:34
source share

login using scrapy api first

 # call scrapy post request with after_login as callback return FormRequest.from_response( response, # formxpath=formxpath, formdata=formdata, callback=self.browse_files ) 

transfer session to selenium chrome driver

 # logged in previously with scrapy api def browse_files(self, response): print "browse files for: %s" % (response.url) # response.headers cookie_list2 = response.headers.getlist('Set-Cookie') print cookie_list2 self.driver.get(response.url) self.driver.delete_all_cookies() # extract all the cookies for cookie2 in cookie_list2: cookies = map(lambda e: e.strip(), cookie2.split(";")) for cookie in cookies: splitted = cookie.split("=") if len(splitted) == 2: name = splitted[0] value = splitted[1] #for my particular usecase I needed only these values if name == 'csrftoken' or name == 'sessionid': cookie_map = {"name": name, "value": value} else: continue elif len(splitted) == 1: cookie_map = {"name": splitted[0], "value": ''} else: continue print "adding cookie" print cookie_map self.driver.add_cookie(cookie_map) self.driver.get(response.url) # check if we have successfully logged in files = self.wait_for_elements_to_be_present(By.XPATH, "//*[@id='files']", response) print files 
0
Jan 05 '17 at 13:42 on
source share



All Articles