Sorry, this does not exist in python, although they do in php. You are more than welcome to use and improve the one I created with the name scraped. Although these are not all sites, this is a recipe-based system that currently only processes NYT, WSJ, and Economist. I am working on a comprehensive algorithm, but this is a serious undertaking. It includes a ton of analysis for different types of html and xml. Even the 3 sites mentioned above have completely different algorithms on how to clean up their sites, which WSJ are the most complex to date. They screw their HTML with such useless crap, mostly just to stop you.
, , lxml, readme. , rss-, , XML RSS 2.0. . lxml, BeautifulSoup feedparser.
http://tinyurl.com/yh3s9pa
, , , .