Python: store many regular matches in a tuple?

I am trying to make a simple Python based HTML parser using regular expressions. My problem is to get a regex search query to find all possible matches and then save them in a tuple.

Say I have a page with the following stored in an HTMLtext variable:

 <ul> <li class="active"><b><a href="/blog/home">Back to the index</a></b></li> <li><b><a href="/blog/about">About Me!</a></b></li> <li><b><a href="/blog/music">Audio Production</a></b></li> <li><b><a href="/blog/photos">Gallery</a></b></li> <li><b><a href="/blog/stuff">Misc</a></b></li> <li><b><a href="/blog/contact">Shoot me an email</a></b></li> </ul> 

I want to do a regular expression search in this text and return a tuple containing the last URL of each link. So, I would like to return something like this:

 pages = ["home", "about", "music", "photos", "stuff", "contact"] 

So far, I could use regex to search for a single result:

 pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)] 

Running this expression does pages = ['home'] .

How can I get the regular expression search to continue the entire text by adding the appropriate text to this tuple?

(Note: I know that I probably should NOT use regular expression to parse HTML . But I want to know how to do this anyway.)

+4
source share
5 answers

Your template will not work on all inputs, including yours. .* will be too greedy (technically it finds the maximum match), forcing it to be the first href and the last matching close. Two easy ways to fix this is to use either a minimal match or a negative character class.

 # minimal match approach pages = re.findall(r'<a\s+href="/blog/(.+?)">', full_html_text, re.I + re.S) # negated charclass approach pages = re.findall(r'<a\s+href="/blog/([^"]+)">', full_html_text, re.I) 

Mandatory Warning

For simple and fairly strict text, regular expressions are just fine; after all, that's why we use regex search-and-replace in our text editors when editing HTML! However, it gets harder the less you know about input, for example

  • if there is another field between <a and <a title="foo" href="bar"> for example <a title="foo" href="bar">
  • housing issues such as <A HREF='foo'>
  • problems with spaces
  • alternative quotes like href='/foo/bar' instead of href="/foo/bar"
  • embedded HTML comments

This is not an exclusive list of issues; there are others. So, using regular expressions in HTML is thus possible , but whether its appropriateness depends on too many other factors to judge.

However, from the small example you showed, it looks great for your own business. You just need to pick up your template and call the correct method.

+2
source

Use the findall module function re :

 pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext) print(pages) 

Conclusion:

 ['home', 'about', 'music', 'photos', 'stuff', 'contact'] 
+2
source

Use findall instead of search :

 >>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext) >>> pages ['home', 'about', 'music', 'photos', 'stuff', 'contact'] 
+1
source

re.findall () and re.finditer () are used to search for multiple matches.

+1
source

To find all results, use findall() . You also need to compile re only once, and then you can reuse it.

 href_re = re.compile('<a href="/blog/(.*)">') # Compile the regexp once pages = href_re.findall(HTMLtext) # Find all matches - ["home", "about", 
+1
source

Source: https://habr.com/ru/post/1403263/


All Articles