Python regex for multiple tags
I would like to know how to get all the results from each tag <p>.
import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()
result:
('item1', )
what I need:
('item1', 'item2', 'item3')
The regular expression response is extremely fragile. Here's the proof (and a working example of BeautifulSoup).
from BeautifulSoup import BeautifulSoup
# Here your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
# Here some simple HTML that breaks your accepted
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'
# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]
Use BeautifulSoup.
For this type of problem, it is recommended that you use the DOM parser rather than a regular expression.
I saw Beautiful Soup , which is often recommended for Python.
You can use re.findallas follows:
import re
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.findall('<p[^>]*size="[0-9]">(.*?)</p>', html)
# This prints: ['item1', 'item2', 'item3']
Edit : ... but, as many commentators have pointed out, using regular expressions for HTML parsing is usually bad.
Alternatively, xml.dom.minidom will parse your HTML if
- ... he is well formed
- ... you insert it into one root element.
eg.
>>> import xml.dom.minidom
>>> htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
>>> d = xml.dom.minidom.parseString('<not_p>%s</not_p>' % htmlText)
>>> tuple(map(lambda e: e.firstChild.wholeText, d.firstChild.childNodes))
('item1', 'item2', 'item3')