DISCLAIMER: First, I agree that, in general, regular expression is not the best tool for parsing HTML. However, in the right hands (and with a few warnings), Philip Hazel is powerful (and most importantly, not REGULAR ). The PCRE library (used by the PHP preg_*() family of functions) allows you to solve non-trivial problems with data cleansing, such as this one (with some limitations and caveats - see below). The task described above is especially difficult to solve only using regular expressions, and regular expression solutions, such as the ones below, are not for everyone and should never be undertaken by a newcomer to regular expressions. To correctly understand the answer below, a fairly deep understanding of several advanced constructions and regular expression methods is required.
Wouldn't anyone think of children! Yes, I read bobince's legendary answer, and I know there is a hot question here (at least). But please, if you are tempted to immediately press the down arrow, because I '/(?:actual|brave|stupid)ly/' use the words: REGEX and: HTML in one go (and there’s no less for a non-trivial problem ), I humbly ask you to refrain long enough to read this entire post and actually try this solution for yourself.
With this in mind, if you want to see how an extended regular expression can be created to solve this problem (for all but a few (unlikely) special cases - see examples below), read on ...
ADVANCED RECURSIVE SOLUTION MODE: As Wes Hardaker correctly points out, a DIV can (and often) is nested. However, he is not 100% right when he says: "You cannot build one that will fit right </div>". True, with PHP, you can! (with some restrictions - see below). Like Perl and .NET, the PCRE regular expression engine in PHP provides recursive expressions (i.e. (?R) , (?1) , (?2) , etc.) that allow you to map nested structures to any arbitrary depth (limited only by memory). For example, you can easily match balanced nested parentheses with this expression: '/\((?:[^()]++|(?R))*+\)/' . Run this simple test if you have any doubts:
$text = 'zero(one(two)one(two(three)two)one)zero'; if (preg_match('/\((?:[^()]++|(?R))*+\)/', $text, $matches)) { print_r($matches); }
So, if we can all agree that the PHP regular expression can really match nested structures, go to the problem. This particular problem is complicated by the fact that the external DIV must have the id="content" attribute, but any nested DIV may or may not be. Thus, we cannot use the (?R) recursively-match-the-whole-expression construct because the subexpression corresponding to the outer DIV does not match the one needed to match the inner DIV s. In this case, we need a capture group (in this case, group 2), which will serve as a "recursive routine" that corresponds to an internal, nested DIV . So, here is a tested piece of PHP code that has an advanced, but not completely commented out one, so that you could actually be capable of creating, (in most cases - see below) a DIV with id="content" , which itself may contain nested DIV s:
$re = '% # Match a DIV element having id="content". <div\b # Start of outer DIV start tag. [^>]*? # Lazily match up to id attrib. \bid\s*+=\s*+ # id attribute name and = ([\'"]?+) # $1: Optional quote delimiter. \bcontent\b # specific ID to be matched. (?(1)\1) # If open quote, match same closing quote [^>]*+> # remaining outer DIV start tag. ( # $2: DIV contents. (may be called recursively!) (?: # Non-capture group for DIV contents alternatives. # DIV contents option 1: All non-DIV, non-comment stuff... [^<]++ # One or more non-tag, non-comment characters. # DIV contents option 2: Start of a non-DIV tag... | < # Match a "<", but only if it (?! # is not the beginning of either /?div\b # a DIV start or end tag, | !-- # or an HTML comment. ) # Ok, that < was not a DIV or comment. # DIV contents Option 3: an HTML comment. | <!--.*?--> # A non-SGML compliant HTML comment. # DIV contents Option 4: a nested DIV element! | <div\b[^>]*+> # Inner DIV element start tag. (?2) # Recurse group 2 as a nested subroutine. </div\s*> # Inner DIV element end tag. )*+ # Zero or more of these contents alternatives. ) # End 2$: DIV contents. </div\s*> # Outer DIV end tag. %isx'; if (preg_match($re, $text, $matches)) { printf("Match found:\n%s\n", $matches[0]); }
As I said, this regex is pretty tricky, but of course it works! except for some of the unlikely cases noted below - (and perhaps a few more that would be greatly appreciated if you could find). Try it and see for yourself!
Should I use this? . Would it be appropriate to use this regular solution in a work environment where hundreds or thousands of documents should be analyzed with 100% reliability and accuracy? Of course not. Could this be useful for limited one-time launch of some HTML files? (for example, perhaps the one who asked this question?) Perhaps. It depends on how comfortable it is with extended regular expressions. If the regular expression above looks like it was written in a foreign language (it is) and / or scares you from being savage, the answer is probably not.
Works? Yes. For example, given the following test data, the correct expression above correctly selects a DIV with id="content" (or id='content' or id=content ):
<!DOCTYPE HTML SYSTEM> <html> <head><title>Test Page</title></head> <body> <div id="non-content-div"> <h1>PCRE does recursion!</h1> <div id='content'> <h2>First level matched</h2> <div id="one-deep"> <h3>Second level matched</h3> <div id=two-deep> <h4>Third level matched</h4> <div id=three-deep> <h4>Fourth level matched</h4> </div> <p>stuff</p> </div> <p>stuff</p> </div> <p>stuff</p> </div> <p>stuff</p> </div> <p>stuff</p> </body></html>
WARNINGS: So, what are some scenarios when this solution does not work? Well, DIV start tags cannot have any angle brackets in any of their attributes (this can be fixed, but this adds a bit more code). And the following CDATA gaps that contain the specific DIV tag we are looking for (very unlikely) will throw a regex error:
<style type="text/css"> p:before { content: 'Unlikely CSS string with <div id=content> in it.'; } </style> <p title="Unlikely attribute with a <div id=content> in it">stuff</p> <script type="text/javascript"> alert("evil script with <div id=content> in it">"); </script> <![CDATA[ a CDATA section with <div id="content"> in it ]]>
I would really like to know about others.
GO READ MRE3 . As I said before, in order to truly understand what is happening here, a fairly thorough understanding of several best practices is required. These methods are not obvious or intuitive. There is only one way I know to get these skills, and this is to sit down and learn: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3). (You will be glad you did!)
I can honestly say that this is the most useful book I've read in my entire life!
Hooray!
EDIT 2013-04-30 Fixed regex. He previously banned the DIV tag, which immediately began with a DIV start tag.