How to remove expired items from a database using Scrapy

I use a spidering video site that often ends content. I am considering using scrapy to do my spidering, but not sure how to remove expired items.

Strategies for detecting item expiration:

  • Site spider "delete.rss".
  • Every few days, try reloading the content page and make sure it still works.
  • Place each page of site content indexes and delete the video if it is not found.

Please let me know how to remove expired items in treatment mode. I will store my scrapy objects in mysql db through django.

2010-01-18 Update

I found a solution that works, but may still not be optimal. I support the flag "found_in_last_scan" on every video that I sync. When the spider starts, it sets all flags to False. When it ends, it deletes videos that still have the False flag set. I did this by joining signals.spider_openedand signals.spider_closed. Please confirm that this is a valid strategy and there are no problems with it.

+3
source share
2 answers

I have not tested this!
I must admit that I have not tried using Django models in Scrapy, but it says:

, , , deleted.rss, XMLFeedSpider ( , ). , , :

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

, IIRC - XML. , deleted.rss, , , URL XML. myproject.items.DeletedUrlItem, ​​ , t DeletedUrlItem, - :

DeletedUrlItem:

class DeletedUrlItem(Item):
    url = Field()

Django Model API Scrapy ItemPipeline - , DjangoItem:

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

delete_item.delete().


, , :-), , .

+4

URL- HTTP, , , ( "" , ), , HTTP- HEAD URL-. Python httplib : c HTTPConnection ( HTTP 1.1, URL- systrem), ( , , HTTP 1.1 ) c request , 'HEAD', URL, ( -, ; -).

request c.getresponse(), HTTPResponse, status , URL - .

, , , HTTP; -).

0

Source: https://habr.com/ru/post/1728214/


All Articles