I think this should be enough:
#!python import re pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE) pattern.search(text)
... assuming your text (HTML) is in a variable called "text".
This also assumes that there are no other HTML tags that can be legally embedded inside the HTML TITLE tag and cannot legally embed any other <character inside such a container / block.
However ...
Do not use regular expressions to parse HTML in Python. Use the HTML parser! (If you are not going to write a complete parser, that would be extra work when the various XML, SGML, and XML parsers are already in standard libraries.
If your handling of the "real world" HTML soup tag (which often does not match any SGML / XML validator), use BeautifulSoup . This is not in standard libraries (yet), but is recommended for this purpose.
Another option: lxml ..., which is written for properly structured (standard compliance) HTML. But he has the opportunity to refuse to use BeautifulSoup as a parser: ElementSoup .
Jim Dennis Aug 25 '09 at 10:35 2009-08-25 10:35
source share