Clear all text with Scrapy without knowing the structure of web pages

Question

Clear all text with Scrapy without knowing the structure of web pages

I am conducting research on the spread of Internet indexing.

While there are several such projects (IRLbot, Distributed-indexing, Cluster-Scrapy, Common-Crawl, etc.), mine is more focused on stimulating such behavior. I am looking for an easy way to crawl real web pages without knowing anything about their URL or HTML structure and:

extract all text (to index it)
Gather all your urls and add them to the crawl urls
Preventing glitches and elegance (even without scraper text) in the event of an incorrect web page

To clarify, this is just for proof of concept (PoC), so I do not mind that it will not scale, slowly, etc. I try to clear most of the text that is provided to the user, in most cases, with or without dynamic content, and with as little “garbage” as functions, tags, keywords, etc. A working, simple partial solution that works out of the box is preferable to an ideal solution requiring a lot of deployment experience.

The secondary problem is storing (url, extracted text) for indexing (by another process?), But I think I can figure it out with a few more digits.

Any advice on how to increase the “strong” parsing function would be greatly appreciated!

import scrapy

from scrapy_1.tutorial.items import WebsiteItem


class FirstSpider(scrapy.Spider):
name = 'itsy'

# allowed_domains = ['dmoz.org'] 

start_urls = \
    [
        "http://www.stackoverflow.com"
    ]

# def parse(self, response):
#     filename = response.url.split("/")[-2] + '.html'
#     with open(filename, 'wb') as f:
#         f.write(response.body)

def parse(self, response):
    for sel in response.xpath('//ul/li'):
        item = WebsiteItem()
        item['title'] = sel.xpath('a/text()').extract()
        item['link'] = sel.xpath('a/@href').extract()
        item['body_text'] = sel.xpath('text()').extract()
        yield item

Run code Hide result

+4

python web-scraping scrapy

UriCS 25 . '16 20:41

1

Granitosaurus · Accepted Answer · 2016-08-26T09:08:40+0000

, , scrapy CrawlSpider

CrawlSpider , . , , , -, .

, CrawlSpider:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'crawlspider'
    start_urls = ['http://scrapy.org']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['title'] = response.meta['link_text']
        # extracting basic body
        item['body'] = '\n'.join(response.xpath('//text()').extract())
        # or better just save whole source
        item['source'] = response.body
        return item

-, -, , URL- . - ( javascript ), , . , , , , html-, , - .

scrapy, .

Clear all text with Scrapy without knowing the structure of web pages

More articles: