Scrapy: changing elements and fields in a response

I'm relatively new to Scrapy, Python, and object-oriented programming, so I apologize if I get any terminology incorrect or unclear in any way.

I am trying to write a spider, which, since it discards elements from the response, will also create a modified version of the response to save to a file. For example, I'm trying to change the 'src' links to point to copied files saved locally.

I am currently clearing data using Scrapy selectors and modifying the response using lxml. However, I want to use Scrapy methods for modification instead of lxml, since using the Scrapy and lxml selectors essentially means doubling the code to search for the same elements in the response.

I added the code below to illustrate my point. Everything happens in the spider analysis function.

def parse (self, response):

    # Scrape thumbnail URLs using Scrapy selectors
    for post in response.css('.post'): # For each post
        for thumb in post.css('.thumb'): # For each thumbnail
            item = Item() # Create an image item
            item['thumbnail_url'] = []
            item['thumbnail_savepath'] = []
            for x in thumb.xpath('img/@src').extract():
                thumbnail_url = 'https:' + x
                thumbnail_filename = re.search('.*/(.*)', thumbnail_url).group(1)
                thumbnail_savepath = 'thumbnails/' + thumbnail_filename
                item['thumbnail_url'] += [thumbnail_url]
                item['thumbnail_savepath'] += [thumbnail_savepath]

    # Make modified html using lxml
    body_lxml = lxml.html.document_fromstring(response.body)
    for thumbnail in body_lxml.xpath('//img'):
        thumbnail_src = thumbnail.get('src') # Original link address
        thumbnail_path = './thumbnails/' + basename(thumbnail_src) # New link address
        thumbnail.set('src',image_path) # Setting new link address

As the code shows, it iterates through the images to clear the elements using Scrapy selectors, and then repeats a second time using lxml to change the response. I have to use two different methods to cycle through the same elements that I am trying to avoid. I would like to make a scraper and a modification in the same cycle, if possible.

I thought you could use the response.request () method, but I am trying to figure out how to use this from the documentation and search on the Internet. Is there any method that allows Scrapy to modify individual elements or fields in a response? Any help would be appreciated.

Thank.

+5
source share
2

, , , ?

, , , ,

, , .

, , "", ? , , .

enter image description here

, , , , ... - - , , , ... case ... , , , ,

0

Scrapy lxml. , Scrapy lxml, Scrapy lxml , , .

Parsel ( Scrapy ) , HTML. , - .

, lxml, .

0

Source: https://habr.com/ru/post/1598596/


All Articles