I am trying to make a simple Python based HTML parser using regular expressions. My problem is to get a regex search query to find all possible matches and then save them in a tuple.
Say I have a page with the following stored in an HTMLtext variable:
<ul> <li class="active"><b><a href="/blog/home">Back to the index</a></b></li> <li><b><a href="/blog/about">About Me!</a></b></li> <li><b><a href="/blog/music">Audio Production</a></b></li> <li><b><a href="/blog/photos">Gallery</a></b></li> <li><b><a href="/blog/stuff">Misc</a></b></li> <li><b><a href="/blog/contact">Shoot me an email</a></b></li> </ul>
I want to do a regular expression search in this text and return a tuple containing the last URL of each link. So, I would like to return something like this:
pages = ["home", "about", "music", "photos", "stuff", "contact"]
So far, I could use regex to search for a single result:
pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]
Running this expression does pages = ['home'] .
How can I get the regular expression search to continue the entire text by adding the appropriate text to this tuple?
(Note: I know that I probably should NOT use regular expression to parse HTML . But I want to know how to do this anyway.)
source share