Python "re" module not working?

Question

Python "re" module not working?

I use the Python "re" module as follows:

request = get("http://www.allmusic.com/album/warning-mw0000106792") print re.findall('<hgroup>(.*?)</hgroup>', request)

All I do is get the HTML for this site and look for this piece of code:

 <hgroup> <h3 class="album-artist"> <a href="http://www.allmusic.com/artist/green-day-mn0000154544">Green Day</a> </h3> <h2 class="album-title"> Warning </h2> </hgroup>

However, it continues to print an empty array. Why is this? Why does re.findall not find this fragment?

+6

python string get

Cisplatin Jul 21 '13 at 20:38

source share

2 answers

re module is not broken. You are probably faced with the fact that not all HTML can be easily matched with simple regular expressions.

Instead, try parsing HTML with the actual HTML parser, for example BeautifulSoup :

 from BeautifulSoup import BeautifulSoup from requests import get request = get("http://www.allmusic.com/album/warning-mw0000106792") soup = BeautifulSoup(request.content) print soup.findAll('hgroup')

Or, alternatively, pyquery :

 from pyquery import PyQuery as pq d = pq(url='http://www.allmusic.com/album/warning-mw0000106792') print d('hgroup')

+6

jsalonen Jul 21 '13 at 20:41

source share

Nolen Royalty · Accepted Answer · 2013-07-21T20:41:31+0000

The HTML you are processing consists of several lines. You must pass the re.DOTALL flag to findall as follows:

 print re.findall('<hgroup>(.*?)</hgroup>', request, re.DOTALL)

It allows . match newlines and returns the correct output.

@jsalonen is right, of course, that parsing HTML with regular expression is a daunting task. However, in small cases, for example, for a one-time script, I would say that this is acceptable.

Python "re" module not working?

More articles: