Think of it a little differently:
- open the page in a browser and click "Show more" until you reach the page you want.
- initialize scrapy
TextResponse with the current page source (with all necessary messages) - for each
Item initialization post, give a Request to the mail page and pass the Item instance from the response request to the meta dictionary
Notes and changes that I present:
The code:
import scrapy from scrapy import signals from scrapy.http import TextResponse from scrapy.xlib.pydispatch import dispatcher from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC class ItalkiItem(scrapy.Item): title = scrapy.Field() url = scrapy.Field() text = scrapy.Field() class ItalkiSpider(scrapy.Spider): name = "italki" allowed_domains = ['italki.com'] start_urls = ['http://www.italki.com/entries/korean'] def __init__(self): self.driver = webdriver.Firefox() dispatcher.connect(self.spider_closed, signals.spider_closed) def spider_closed(self, spider): self.driver.close() def parse(self, response):
This is what you should use as the base code and improve the filling of all other fields, for example author or author_url . Hope this helps.
source share