Python: store many regular matches in a tuple?

Question

Python: store many regular matches in a tuple?

I am trying to make a simple Python based HTML parser using regular expressions. My problem is to get a regex search query to find all possible matches and then save them in a tuple.

Say I have a page with the following stored in an HTMLtext variable:

 <ul> <li class="active"><b><a href="/blog/home">Back to the index</a></b></li> <li><b><a href="/blog/about">About Me!</a></b></li> <li><b><a href="/blog/music">Audio Production</a></b></li> <li><b><a href="/blog/photos">Gallery</a></b></li> <li><b><a href="/blog/stuff">Misc</a></b></li> <li><b><a href="/blog/contact">Shoot me an email</a></b></li> </ul>

I want to do a regular expression search in this text and return a tuple containing the last URL of each link. So, I would like to return something like this:

 pages = ["home", "about", "music", "photos", "stuff", "contact"]

So far, I could use regex to search for a single result:

 pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]

Running this expression does pages = ['home'] .

How can I get the regular expression search to continue the entire text by adding the appropriate text to this tuple?

(Note: I know that I probably should NOT use regular expression to parse HTML . But I want to know how to do this anyway.)

+4

python html regex parsing

hao_maike Mar 24 '12 at 20:28

source share

5 answers

Use the findall module function re :

 pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext) print(pages)

Conclusion:

 ['home', 'about', 'music', 'photos', 'stuff', 'contact']

+2

ovgolovin Mar 24 '12 at 20:34

source share

Use findall instead of search :

 >>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext) >>> pages ['home', 'about', 'music', 'photos', 'stuff', 'contact']

+1

Simeon visser Mar 24 '12 at 20:33

source share

re.findall () and re.finditer () are used to search for multiple matches.

+1

Raymond hettinger Mar 24 '12 at 20:35

source share

To find all results, use findall() . You also need to compile re only once, and then you can reuse it.

 href_re = re.compile('<a href="/blog/(.*)">') # Compile the regexp once pages = href_re.findall(HTMLtext) # Find all matches - ["home", "about",

+1

Mariusz jamro Mar 24 '12 at 20:36

source share

tchrist · Accepted Answer · 2012-03-24T20:55:58+0000

Your template will not work on all inputs, including yours. .* will be too greedy (technically it finds the maximum match), forcing it to be the first href and the last matching close. Two easy ways to fix this is to use either a minimal match or a negative character class.

 # minimal match approach pages = re.findall(r'<a\s+href="/blog/(.+?)">', full_html_text, re.I + re.S) # negated charclass approach pages = re.findall(r'<a\s+href="/blog/([^"]+)">', full_html_text, re.I)

Mandatory Warning

For simple and fairly strict text, regular expressions are just fine; after all, that's why we use regex search-and-replace in our text editors when editing HTML! However, it gets harder the less you know about input, for example

if there is another field between <a and <a title="foo" href="bar"> for example <a title="foo" href="bar">
housing issues such as <A HREF='foo'>
problems with spaces
alternative quotes like href='/foo/bar' instead of href="/foo/bar"
embedded HTML comments

This is not an exclusive list of issues; there are others. So, using regular expressions in HTML is thus possible , but whether its appropriateness depends on too many other factors to judge.

However, from the small example you showed, it looks great for your own business. You just need to pick up your template and call the correct method.

Python: store many regular matches in a tuple?

Mandatory Warning

More articles: