How to parse XML feed using python?

Question

How to parse XML feed using python?

I am trying to parse this xml (http://www.reddit.com/r/videos/top/.rss) and I am having problems. I am trying to save youtube links in each of the elements, but I am having problems due to the node's child channel. How do I get to this level so that I can then iterate over the elements?

#reddit parse reddit_file = urllib2.urlopen('http://www.reddit.com/r/videos/top/.rss') #convert to string: reddit_data = reddit_file.read() #close file because we dont need it anymore: reddit_file.close() #entire feed reddit_root = etree.fromstring(reddit_data) channel = reddit_root.findall('{http://purl.org/dc/elements/1.1/}channel') print channel reddit_feed=[] for entry in channel: #get description, url, and thumbnail desc = #not sure how to get this reddit_feed.append([desc])

+4

python xml parsing

sharataka Oct 14 '12 at 2:51

source share

2 answers

I wrote that Xpath expressions are used for you (successfully tested):

 from lxml import etree import urllib2 headers = { 'User-Agent' : 'Mozilla/5.0' } req = urllib2.Request('http://www.reddit.com/r/videos/top/.rss', None, headers) reddit_file = urllib2.urlopen(req).read() reddit = etree.fromstring(reddit_file) for item in reddit.xpath('/rss/channel/item'): print "title =", item.xpath("./title/text()")[0] print "description =", item.xpath("./description/text()")[0] print "thumbnail =", item.xpath("./*[local-name()='thumbnail']/@url")[0] print "link =", item.xpath("./link/text()")[0] print "-" * 100

+3

Gilles quenot Oct 14 '12 at 3:41

source share

Himanshu · Accepted Answer · 2012-10-14T03:41:57+0000

You can try findall('channel/item')

 import urllib2 from xml.etree import ElementTree as etree #reddit parse reddit_file = urllib2.urlopen('http://www.reddit.com/r/videos/top/.rss') #convert to string: reddit_data = reddit_file.read() print reddit_data #close file because we dont need it anymore: reddit_file.close() #entire feed reddit_root = etree.fromstring(reddit_data) item = reddit_root.findall('channel/item') print item reddit_feed=[] for entry in item: #get description, url, and thumbnail desc = entry.findtext('description') reddit_feed.append([desc])

How to parse XML feed using python?

More articles: