Extract Regular Expression Part

Question

Extract Regular Expression Part

I want a regular expression to extract the title from an HTML page. I currently have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group() if title: title = title.replace('<title>', '').replace('</title>', '')

Is there a regex to extract only <title> content, so I don’t need to remove tags?

+86

python html regex html-content-extraction

hoju Aug 25 '09 at 10:24

source share

10 answers

Please do not use regex to parse markup languages. Use lxml or beautifulsoup.

+34

iElectric Aug 25 '09 at 10:31

source share

Try using capture groups:

 title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

+6

Aaron Maenpaa Aug 25 '09 at 10:30

source share

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

+3

Vinay Sajip Aug 25 '09 at 10:28

source share

Try:

 title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

+2

Randy Aug 25 '09 at 10:28

source share

Using regular expressions for HTML parsing is usually not a good idea. You can use any HTML parser for this, such as Beautiful Soup. Check out http://www.crummy.com/software/BeautifulSoup/documentation.html

Also remember that some people who encounter a problem think, "I know, I will use regular expressions." Now they have two problems.

+2

Vihang D Aug 25 '09 at 10:35

source share

May I recommend you a wonderful soup. Soup is a very good library for analyzing your entire html document.

 soup = BeatifulSoup(html_doc) titleName = soup.title.name

+2

kharagpur Mar 01 '13 at 19:22

source share

The provided code snippets do not cope with Exceptions Can I suggest

 getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns the default empty string if the pattern is not found or the first match.

+2

Steve K Oct 27 '13 at 14:07 on

source share

I think this should be enough:

 #!python import re pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE) pattern.search(text)

... assuming your text (HTML) is in a variable called "text".

This also assumes that there are no other HTML tags that can be legally embedded inside the HTML TITLE tag and cannot legally embed any other <character inside such a container / block.

However ...

Do not use regular expressions to parse HTML in Python. Use the HTML parser! (If you are not going to write a complete parser, that would be extra work when the various XML, SGML, and XML parsers are already in standard libraries.

If your handling of the "real world" HTML soup tag (which often does not match any SGML / XML validator), use BeautifulSoup . This is not in standard libraries (yet), but is recommended for this purpose.

Another option: lxml ..., which is written for properly structured (standard compliance) HTML. But he has the opportunity to refuse to use BeautifulSoup as a parser: ElementSoup .

+1

Jim Dennis Aug 25 '09 at 10:35

source share

Note that starting with Python 3.8 and introducing assignment expressions (PEP 572) ( := operator), you can slightly improve the Krzysztof Krasoń solution by capturing the comparison result directly in the if condition as a variable and reusing it in the body state:

 # pattern = '<title>(.*)</title>' # text = '<title>hello</title>' if match := re.search(pattern, text, re.IGNORECASE): title = match.group(1) # hello

0

Xavier Guihot Apr 27 '19 at 15:06

source share

Krzysztof Krasoń · Accepted Answer · 2009-08-25 10:29

Use ( ) in regexp and group(1) in python to retrieve the captured string ( re.search will return None if it doesn't find the result, so don't use group() directly):

 title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE) if title_search: title = title_search.group(1)

Extract Regular Expression Part

More articles: