Effective way to extract text from tags

Question

Effective way to extract text from tags

Suppose I have something like this:

var = '<li> <a href="/...html">Energy</a> <ul> <li> <a href="/...html">Coal</a> </li> <li> <a href="/...html">Oil </a> </li> <li> <a href="/...html">Carbon</a> </li> <li> <a href="/...html">Oxygen</a> </li'

What is the best (most efficient) way to extract text between tags? Should I use regex for this? My current method is based on splitting a string on li tags and using a for loop, just wondering if there was a faster way to do this.

+4

python regex extract

Max kim Jun 19 '13 at 1:42

source share

4 answers

The recommended way to extract information from the markup language is to use an analyzer, for example Beautiful Soup is a good choice. Avoid using regular expressions for this; this is not the right tool for the job!

+6

Óscar López Jun 19 '13 at 1:46

source share

if you want to go along the regex path (which some consider it a sin to parse HTML / XML), you can try something like this:

 re.findall('(?<=>)([^<]+)(?=</a>[^<]*</li)', var, re.S)

Personally, I think the regex is great for one-time or simple use cases, but you need to be very careful when writing your regex so as not to create patterns that can be unexpectedly greedy. For complex document analysis, it is always better to use a module, such as BeautifulSoup .

+2

woemler Jun 19 '13 at 1:49

source share

If you are only after parsing what's inside the tags, try using xpath, for example.

 for text in var.xpath_all(".//ul/li"): text = li.xpath('.//a/text()') print text

You can also use urllib, BeautifulSoup etc.

+2

Ardy dedase Jun 19 '13 at 1:51

source share

Davi sampaio · Accepted Answer · 2013-06-19T06:16:01+0000

You can use Beautiful Soup , which is very good for this kind of task. It is very simple, easy to install and with great documentation.

In your example, several li tags are not closed. I already made corrections, and here's how to get all the li tags

 from bs4 import BeautifulSoup var = '''<li> <a href="/...html">Energy</a></li> <ul> <li><a href="/...html">Coal</a></li> <li><a href="/...html">Oil </a></li> <li><a href="/...html">Carbon</a></li> <li><a href="/...html">Oxygen</a></li>''' soup = BeautifulSoup(var) for a in soup.find_all('a'): print a.string

He will print:

Energy
Koa
Oil
Carbon
Oxygen

For documentation and other examples, see BeautifulSoup doc

Effective way to extract text from tags

More articles: