Effective way to extract text from tags
Suppose I have something like this:
var = '<li> <a href="/...html">Energy</a> <ul> <li> <a href="/...html">Coal</a> </li> <li> <a href="/...html">Oil </a> </li> <li> <a href="/...html">Carbon</a> </li> <li> <a href="/...html">Oxygen</a> </li' What is the best (most efficient) way to extract text between tags? Should I use regex for this? My current method is based on splitting a string on li tags and using a for loop, just wondering if there was a faster way to do this.
You can use Beautiful Soup , which is very good for this kind of task. It is very simple, easy to install and with great documentation.
In your example, several li tags are not closed. I already made corrections, and here's how to get all the li tags
from bs4 import BeautifulSoup var = '''<li> <a href="/...html">Energy</a></li> <ul> <li><a href="/...html">Coal</a></li> <li><a href="/...html">Oil </a></li> <li><a href="/...html">Carbon</a></li> <li><a href="/...html">Oxygen</a></li>''' soup = BeautifulSoup(var) for a in soup.find_all('a'): print a.string He will print:
Energy
Koa
Oil
Carbon
Oxygen
For documentation and other examples, see BeautifulSoup doc
The recommended way to extract information from the markup language is to use an analyzer, for example Beautiful Soup is a good choice. Avoid using regular expressions for this; this is not the right tool for the job!
if you want to go along the regex path (which some consider it a sin to parse HTML / XML), you can try something like this:
re.findall('(?<=>)([^<]+)(?=</a>[^<]*</li)', var, re.S) Personally, I think the regex is great for one-time or simple use cases, but you need to be very careful when writing your regex so as not to create patterns that can be unexpectedly greedy. For complex document analysis, it is always better to use a module, such as BeautifulSoup .