Python "re" module not working?

I use the Python "re" module as follows:

request = get("http://www.allmusic.com/album/warning-mw0000106792") print re.findall('<hgroup>(.*?)</hgroup>', request) 

All I do is get the HTML for this site and look for this piece of code:

 <hgroup> <h3 class="album-artist"> <a href="http://www.allmusic.com/artist/green-day-mn0000154544">Green Day</a> </h3> <h2 class="album-title"> Warning </h2> </hgroup> 

However, it continues to print an empty array. Why is this? Why does re.findall not find this fragment?

+6
source share
2 answers

The HTML you are processing consists of several lines. You must pass the re.DOTALL flag to findall as follows:

 print re.findall('<hgroup>(.*?)</hgroup>', request, re.DOTALL) 

It allows . match newlines and returns the correct output.

@jsalonen is right, of course, that parsing HTML with regular expression is a daunting task. However, in small cases, for example, for a one-time script, I would say that this is acceptable.

+9
source

re module is not broken. You are probably faced with the fact that not all HTML can be easily matched with simple regular expressions.

Instead, try parsing HTML with the actual HTML parser, for example BeautifulSoup :

 from BeautifulSoup import BeautifulSoup from requests import get request = get("http://www.allmusic.com/album/warning-mw0000106792") soup = BeautifulSoup(request.content) print soup.findAll('hgroup') 

Or, alternatively, pyquery :

 from pyquery import PyQuery as pq d = pq(url='http://www.allmusic.com/album/warning-mw0000106792') print d('hgroup') 
+6
source

Source: https://habr.com/ru/post/949948/


All Articles