I am conducting research on the spread of Internet indexing.
While there are several such projects (IRLbot, Distributed-indexing, Cluster-Scrapy, Common-Crawl, etc.), mine is more focused on stimulating such behavior. I am looking for an easy way to crawl real web pages without knowing anything about their URL or HTML structure and:
- extract all text (to index it)
- Gather all your urls and add them to the crawl urls
- Preventing glitches and elegance (even without scraper text) in the event of an incorrect web page
To clarify, this is just for proof of concept (PoC), so I do not mind that it will not scale, slowly, etc. I try to clear most of the text that is provided to the user, in most cases, with or without dynamic content, and with as little “garbage” as functions, tags, keywords, etc. A working, simple partial solution that works out of the box is preferable to an ideal solution requiring a lot of deployment experience.
The secondary problem is storing (url, extracted text) for indexing (by another process?), But I think I can figure it out with a few more digits.
Any advice on how to increase the “strong” parsing function would be greatly appreciated!
import scrapy
from scrapy_1.tutorial.items import WebsiteItem
class FirstSpider(scrapy.Spider):
name = 'itsy'
# allowed_domains = ['dmoz.org']
start_urls = \
[
"http://www.stackoverflow.com"
]
# def parse(self, response):
# filename = response.url.split("/")[-2] + '.html'
# with open(filename, 'wb') as f:
# f.write(response.body)
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = WebsiteItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['body_text'] = sel.xpath('text()').extract()
yield item
Run codeHide result