I get the Scrapy example from the website, it works, but it seems to be something wrong: it cannot get all the content, and I don't know what happened. This example uses Scrapy + Redis + MongoDB.
information:
2015-10-09 01:43:33 [scrapy] INFO: Crawled 292 pages (at 292 pages/min), scraped 291 items (at 291 items/min) 2015-10-09 01:44:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:45:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:46:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:47:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:48:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:49:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:50:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:51:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:52:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:53:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:54:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:55:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:56:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:57:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min) 2015-10-09 01:58:33 [scrapy] INFO: Crawled 292 pages (at 0 pages/min), scraped 291 items (at 0 items/min)
novspider.py
#-*-coding:utf8-*- from scrapy_redis.spiders import RedisSpider from scrapy.selector import Selector from scrapy.http import Request from novelspider.items import NovelspiderItem import re class novSpider(RedisSpider): name = "novspider" redis_key = 'nvospider:start_urls' start_urls = ['http://www.daomubiji.com/'] def parse(self,response): selector = Selector(response) table = selector.xpath('//table') for each in table: bookName = each.xpath('tr/td[@colspan="3"]/center/h2/text()').extract()[0] content = each.xpath('tr/td/a/text()').extract() url = each.xpath('tr/td/a/@href').extract() for i in range(len(url)): item = NovelspiderItem() item['bookName'] = bookName item['chapterURL'] = url[i] try: item['bookTitle'] = content[i].split(' ')[0] item['chapterNum'] = content[i].split(' ')[1] except Exception,e: continue try: item['chapterName'] = content[i].split(' ')[2] except Exception,e: item['chapterName'] = content[i].split(' ')[1][-3:] yield Request(url[i], callback='parseContent', meta={'item':item}) def parseContent(self, response): selector = Selector(response) item = response.meta['item'] html = selector.xpath('//div[@class="content"]').extract()[0] textField = re.search('<div style="clear:both"></div>(.*?)<div', html,re.S).group(1) text = re.findall('<p>(.*?)</p>',textField,re.S) fulltext = '' for each in text: fulltext += each item['text'] = fulltext yield item
settings.py
# -*- coding: utf-8 -*-
pipelines.py
# -*- coding: utf-8 -*-
items.py
# -*- coding: utf-8 -*-
source share