Python regex for multiple tags

Question

Python regex for multiple tags

I would like to know how to get all the results from each tag <p>.

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

result:

('item1', )

what I need:

('item1', 'item2', 'item3')

+2

python html regex

Felipe andrade Jun 09 '09 at 22:09

source share

5 answers

For this type of problem, it is recommended that you use the DOM parser rather than a regular expression.

I saw Beautiful Soup , which is often recommended for Python.

+11

Peter Boughton Jun 09 '09 at 10:14

source share

- , , . . , .

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

p-.

+5

Brett Bim 09 . '09 23:00

You can use re.findallas follows:

import re
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.findall('<p[^>]*size="[0-9]">(.*?)</p>', html)
# This prints: ['item1', 'item2', 'item3']

Edit : ... but, as many commentators have pointed out, using regular expressions for HTML parsing is usually bad.

+2

Richiehindle Jun 09 '09 at 10:12

source share

Alternatively, xml.dom.minidom will parse your HTML if

... he is well formed
... you insert it into one root element.

eg.

>>> import xml.dom.minidom
>>> htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
>>> d = xml.dom.minidom.parseString('<not_p>%s</not_p>' % htmlText)
>>> tuple(map(lambda e: e.firstChild.wholeText, d.firstChild.childNodes))
('item1', 'item2', 'item3')

+2

Stephan202 Jun 09 '09 at 10:38

source share

Triptych · Accepted Answer · 2009-06-10T03:19:07+0000

The regular expression response is extremely fragile. Here's the proof (and a working example of BeautifulSoup).

from BeautifulSoup import BeautifulSoup

# Here your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]

Use BeautifulSoup.

Python regex for multiple tags

More articles: