Using a regular expression to match a div block with a specific identifier

Question

Using a regular expression to match a div block with a specific identifier

I am trying to match a div block that has a specific identifier. Here is my regex code:

<div\s+[^>]*\s*id\s*=\s*["|']content["|']\s*>[^/div]+

I want the regex to match the entire div block. So I put [^ / div] + in my regex, I assume that it will match the remaining characters until it reaches the end, but it couldn’t match until the end, because the expression [^] thought I didn’t want to match any that is </ | d | me | v | >. I want all this to be considered as a whole. Unable to help [^ ()]. A.

So please tell me how should I code this problem.

 <div id="content"> <noscript></noscript> <a href="blabla.com"> <h1> <a href="blablac.com">Blablabla</a> </h1> </div>

+4

html php regex

Kevin lee Mar 18 '11 at 17:08

source share

5 answers

[^ / div] + will stop when it reaches any of these characters, which you don’t want. Because it will stop when it reaches too because of i.

Unfortunately, you cannot do what you want without knowing the internal structure of HTML. Consider this:

 <div id="content"> <div id="somethingelse"> </div> </div>

Even if you could create a regular expression that matches before </div> , you cannot create one that matches before the correct </div> . You need to do a much more intense analysis.

+3

Wes hardaker Mar 18 '11 at 17:12

source share

Use a parser rather than a regular expression.

Here is a PHP example: http://htmlparsing.com/php.html

0

Andy lester Mar 18 '11 at 18:41

source share

This article is awesome and the perfect solution for my needs!

It even works in html code where a simpleXML or DOMDocument crash does not work.

Sometimes you need to parse html code generated by a third party , on which you do not have a control, and do not respect dtd , so recursive regular expressions come here.

I just add a few modifications to your code and use it with the PHP function preg_match_all.

In the following example, we will try to correctly match the contents of div # :

 $content = <<<HTML <div id="content"> <!-- tutu --> <div id="something"> <div id="somethingElse"> <ul> <li>lorem 1</li> <li class="dfg" toto="titi">lorem 2</li> <li class="dfg">lorem 3</li> <li class="dfg">lorem 4</li> <li class="dfg">lorem 5</li> <li class="dfg">lorem 6</li> </ul> <br /> <div id="emptyStuff"></div> </div> </div> <table> <tr> <td>cell 1</td> <td>cell 2</td> <td>cell 3</td> <td>cell 4</td> <td>cell 5</td> <td>cell 6</td> </tr> <tr> <td>cell 1</td> <td>cell 2</td> <td>cell 3</td> <td>cell 4</td> <td>cell 5</td> <td>cell 6</td> </tr> </table> </div> HTML; $pattern = '@# match nested tag (?(DEFINE) (?<comment> <!--.*?-->) (?<cdata> <![CDATA[.*?]]>) (?<empty> <\w+[^>]*?/>) (?<inline> <(script|style)[^>]+>.*?</\g{-1}>) (?<nested> <(\w+)[^>]*(?<!/)>(?&innerHTML)</\g{-1}>) (?<unclosed> <\w+[^>]*(?<!/)>) (?<text> [^<]+) ) (?<outerHTML><(?<tagName>div)\s?(?<attributes>[^>]*?id\h*=\h*(?<quote>"|\')[^(?&quote)\v>]*\bcontent\b[^(?&quote)\v>]*(?&quote)[^>]*)> # opening tag (?<innerHTML> (?: (?&comment) | (?&cdata) | (?&empty) | (?&inline) | (?&nested) | (?&unclosed) | (?&text) )* ) </(?&tagName)>) # closing tag @six'; preg_match_all($pattern, $content, $matches); var_dump(array_intersect_key($matches, array( 'tagName' => 1, 'attributes' => 1, 'innerHTML' => 1, 'outerHTML' => 1 )));

Here is the conclusion :

 array(4) { ["outerHTML"]=> array(1) { [0]=> string(639) "<div id="content"> <!-- tutu --> <div id="something"> <div id="somethingElse"> <ul> <li>lorem 1</li> <li class="dfg" toto="titi">lorem 2</li> <li class="dfg">lorem 3</li> <li class="dfg">lorem 4</li> <li class="dfg">lorem 5</li> <li class="dfg">lorem 6</li> </ul> <br /> <div id="emptyStuff"></div> </div> </div> <table> <tr> <td>cell 1</td> <td>cell 2</td> <td>cell 3</td> <td>cell 4</td> <td>cell 5</td> <td>cell 6</td> </tr> <tr> <td>cell 1</td> <td>cell 2</td> <td>cell 3</td> <td>cell 4</td> <td>cell 5</td> <td>cell 6</td> </tr> </table> </div>" } ["tagName"]=> array(1) { [0]=> string(3) "div" } ["attributes"]=> array(1) { [0]=> string(12) "id="content"" } ["innerHTML"]=> array(1) { [0]=> string(615) " <!-- tutu --> <div id="something"> <div id="somethingElse"> <ul> <li>lorem 1</li> <li class="dfg" toto="titi">lorem 2</li> <li class="dfg">lorem 3</li> <li class="dfg">lorem 4</li> <li class="dfg">lorem 5</li> <li class="dfg">lorem 6</li> </ul> <br /> <div id="emptyStuff"></div> </div> </div> <table> <tr> <td>cell 1</td> <td>cell 2</td> <td>cell 3</td> <td>cell 4</td> <td>cell 5</td> <td>cell 6</td> </tr> <tr> <td>cell 1</td> <td>cell 2</td> <td>cell 3</td> <td>cell 4</td> <td>cell 5</td> <td>cell 6</td> </tr> </table> " } }

Hope this helps!

0

shaft Aug 20 '12 at 15:10

source share

 <div id=content>.*?</div>

is what you need - as long as you don't have nested divs. If you have them, pass and use the actual XML parser.

Turn on the dotall option (check out http://www.regular-expressions.info/dot.html and find out how to do this with your regex fragrance).

Minor details to you. :-)

-1

Mauro vanetti Mar 18 '11 at 17:27

source share

ridgerunner · Accepted Answer · 2011-03-19T00:49:21+0000

DISCLAIMER: First, I agree that, in general, regular expression is not the best tool for parsing HTML. However, in the right hands (and with a few warnings), Philip Hazel is powerful (and most importantly, not REGULAR ). The PCRE library (used by the PHP preg_*() family of functions) allows you to solve non-trivial problems with data cleansing, such as this one (with some limitations and caveats - see below). The task described above is especially difficult to solve only using regular expressions, and regular expression solutions, such as the ones below, are not for everyone and should never be undertaken by a newcomer to regular expressions. To correctly understand the answer below, a fairly deep understanding of several advanced constructions and regular expression methods is required.

Wouldn't anyone think of children! Yes, I read bobince's legendary answer, and I know there is a hot question here (at least). But please, if you are tempted to immediately press the down arrow, because I '/(?:actual|brave|stupid)ly/' use the words: REGEX and: HTML in one go (and there’s no less for a non-trivial problem ), I humbly ask you to refrain long enough to read this entire post and actually try this solution for yourself.

With this in mind, if you want to see how an extended regular expression can be created to solve this problem (for all but a few (unlikely) special cases - see examples below), read on ...

ADVANCED RECURSIVE SOLUTION MODE: As Wes Hardaker correctly points out, a DIV can (and often) is nested. However, he is not 100% right when he says: "You cannot build one that will fit right </div>". True, with PHP, you can! (with some restrictions - see below). Like Perl and .NET, the PCRE regular expression engine in PHP provides recursive expressions (i.e. (?R) , (?1) , (?2) , etc.) that allow you to map nested structures to any arbitrary depth (limited only by memory). For example, you can easily match balanced nested parentheses with this expression: '/\((?:[^()]++|(?R))*+\)/' . Run this simple test if you have any doubts:

 $text = 'zero(one(two)one(two(three)two)one)zero'; if (preg_match('/\((?:[^()]++|(?R))*+\)/', $text, $matches)) { print_r($matches); }

So, if we can all agree that the PHP regular expression can really match nested structures, go to the problem. This particular problem is complicated by the fact that the external DIV must have the id="content" attribute, but any nested DIV may or may not be. Thus, we cannot use the (?R) recursively-match-the-whole-expression construct because the subexpression corresponding to the outer DIV does not match the one needed to match the inner DIV s. In this case, we need a capture group (in this case, group 2), which will serve as a "recursive routine" that corresponds to an internal, nested DIV . So, here is a tested piece of PHP code that has an advanced, but not completely commented out one, so that you could actually be capable of creating, (in most cases - see below) a DIV with id="content" , which itself may contain nested DIV s:

 $re = '% # Match a DIV element having id="content". <div\b # Start of outer DIV start tag. [^>]*? # Lazily match up to id attrib. \bid\s*+=\s*+ # id attribute name and = ([\'"]?+) # $1: Optional quote delimiter. \bcontent\b # specific ID to be matched. (?(1)\1) # If open quote, match same closing quote [^>]*+> # remaining outer DIV start tag. ( # $2: DIV contents. (may be called recursively!) (?: # Non-capture group for DIV contents alternatives. # DIV contents option 1: All non-DIV, non-comment stuff... [^<]++ # One or more non-tag, non-comment characters. # DIV contents option 2: Start of a non-DIV tag... | < # Match a "<", but only if it (?! # is not the beginning of either /?div\b # a DIV start or end tag, | !-- # or an HTML comment. ) # Ok, that < was not a DIV or comment. # DIV contents Option 3: an HTML comment. | <!--.*?--> # A non-SGML compliant HTML comment. # DIV contents Option 4: a nested DIV element! | <div\b[^>]*+> # Inner DIV element start tag. (?2) # Recurse group 2 as a nested subroutine. </div\s*> # Inner DIV element end tag. )*+ # Zero or more of these contents alternatives. ) # End 2$: DIV contents. </div\s*> # Outer DIV end tag. %isx'; if (preg_match($re, $text, $matches)) { printf("Match found:\n%s\n", $matches[0]); }

As I said, this regex is pretty tricky, but of course it works! except for some of the unlikely cases noted below - (and perhaps a few more that would be greatly appreciated if you could find). Try it and see for yourself!

Should I use this? . Would it be appropriate to use this regular solution in a work environment where hundreds or thousands of documents should be analyzed with 100% reliability and accuracy? Of course not. Could this be useful for limited one-time launch of some HTML files? (for example, perhaps the one who asked this question?) Perhaps. It depends on how comfortable it is with extended regular expressions. If the regular expression above looks like it was written in a foreign language (it is) and / or scares you from being savage, the answer is probably not.

Works? Yes. For example, given the following test data, the correct expression above correctly selects a DIV with id="content" (or id='content' or id=content ):

 <!DOCTYPE HTML SYSTEM> <html> <head><title>Test Page</title></head> <body> <div id="non-content-div"> <h1>PCRE does recursion!</h1> <div id='content'> <h2>First level matched</h2> <!-- this comment </div> is tricky --> <div id="one-deep"> <h3>Second level matched</h3> <div id=two-deep> <h4>Third level matched</h4> <div id=three-deep> <h4>Fourth level matched</h4> </div> <p>stuff</p> </div> <!-- this comment <div> is tricky --> <p>stuff</p> </div> <p>stuff</p> </div> <p>stuff</p> </div> <p>stuff</p> </body></html>

WARNINGS: So, what are some scenarios when this solution does not work? Well, DIV start tags cannot have any angle brackets in any of their attributes (this can be fixed, but this adds a bit more code). And the following CDATA gaps that contain the specific DIV tag we are looking for (very unlikely) will throw a regex error:

 <style type="text/css"> p:before { content: 'Unlikely CSS string with <div id=content> in it.'; } </style> <p title="Unlikely attribute with a <div id=content> in it">stuff</p> <script type="text/javascript"> alert("evil script with <div id=content> in it">"); </script> <!-- Comment with <div id="content"> in it --> <![CDATA[ a CDATA section with <div id="content"> in it ]]>

I would really like to know about others.

GO READ MRE3 . As I said before, in order to truly understand what is happening here, a fairly thorough understanding of several best practices is required. These methods are not obvious or intuitive. There is only one way I know to get these skills, and this is to sit down and learn: Mastering Regular Expressions (3rd Edition) by Jeffrey Friedl (MRE3). (You will be glad you did!)

I can honestly say that this is the most useful book I've read in my entire life!

Hooray!

EDIT 2013-04-30 Fixed regex. He previously banned the DIV tag, which immediately began with a DIV start tag.

Using a regular expression to match a div block with a specific identifier

More articles: