Python etag / last change not working; how to get the latest rss

I am trying to write a python program that will capture and display any rss updates from the moment the program was last run. I am using feedparser and trying to use etags and the latter is modified as described here on SO but my test script does not seem to work.

import feedparser rsslist=["http://skottieyoung.tumblr.com/rss","http://mrjakeparker.com/feed/"] for feed in rsslist: print('--------'+feed+'-------') d=feedparser.parse(feed) print(len(d.entries)) if (len(d.entries) > 0): etag=d.feed.get('etag','') modified=d.get('modified',d.get('updated',d.entries[0].get('published','no modified,update or published fields present in rss'))) d2=feedparser.parse(feed,modified) if (len(d2.entries) > 0): etag2=d2.feed.get('etag','') modified2=d2.get('updated',d.entries[0].get('published','')) if (d2==d): #ideally we would never see this bc etags/last modified would prevent unnecessarily downloading what we all ready have. print("Arrg these are the same") 

I'm honestly not sure if the rss / xml technology has changed from the links I used on the Internet, or if there is a problem with my code.

No matter what, I'm looking for the best solution for efficient use of rss channels. In essence, I try to minimize the bandwidth loss, for example, using the last-modified and etags fields.

Thanks in advance.

+4
source share
2 answers

Your problem is that you are moving the last modified date instead of etag . etag is the second argument to the parse() method, modified is the third argument.

Instead:

 d2=feedparser.parse(feed,modified) 

make:

 d2=feedparser.parse(feed,modified=modified) 

After looking at the source code, it seems that only passing etag or modified to the parse() function sends the appropriate headers to the server so that the server can return an empty response if nothing has changed. If the server does not support this, the server will simply return the full RSS feed. I would modify your code to check the dates of each record, and ignore one with a date that is less than the maximum date in the previous query:

 import feedparser rsslist=["http://skottieyoung.tumblr.com/rss", "http://mrjakeparker.com/feed/"] def feed_modified_date(feed): # this is the last-modified value in the response header # do not confuse this with the time that is in each feed as the server # may be using a different timezone for last-resposne headers than it # uses for the publish date modified = feed.get('modified') if modified is not None: return modified return None def max_entry_date(feed): entry_pub_dates = (e.get('published_parsed') for e in feed.entries) entry_pub_dates = tuple(e for e in entry_pub_dates if e is not None) if len(entry_pub_dates) > 0: return max(entry_pub_dates) return None def entries_with_dates_after(feed, date): response = [] for entry in feed.entries: if entry.get('published_parsed') > date: response.append(entry) return response for feed_url in rsslist: print('--------%s-------' % feed_url) d = feedparser.parse(feed_url) print('feed length %i' % len(d.entries)) if len(d.entries) > 0: etag = d.feed.get('etag', None) modified = feed_modified_date(d) print('modified at %s' % modified) d2 = feedparser.parse(feed_url, etag=etag, modified=modified) print('second feed length %i' % len(d2.entries)) if len(d2.entries) > 0: print("server does not support etags or there are new entries") # perhaps the server does not support etags or last-modified # filter entries ourself prev_max_date = max_entry_date(d) entries = entries_with_dates_after(d2, prev_max_date) print('%i new entries' % len(entries)) else: print('there are no entries') 

This gives:

 --------http://skottieyoung.tumblr.com/rss------- feed length 20 modified at None second feed length 20 server does not support etags or there are new entries 0 new entries --------http://mrjakeparker.com/feed/------- feed length 10 modified at Wed, 07 Nov 2012 19:27:48 GMT second feed length 0 there are no entries 
+6
source

I would suggest using Date in the header as a backup if there is no etag or modified information in the feed.

Use feed['headers']['Date'] , which can be used as follows.

feedparser.parse(url, modified=feed['headers']['Date'])

Change But it seems that some servers are ignoring the modified parameter.

0
source

Source: https://habr.com/ru/post/1444920/


All Articles