Regex - matching tag names in HTML only

Question

Regex - matching tag names in HTML only

How can I use regex to extract all html tag names in an HTML snippet? I use PHP for this, if that matters. For instance:

<div id="someid"> <img src="someurl" /> <br /> <p>some content</p> </div>

should return: div, img, br, p.

+1

html regex

VinnyD Aug 24 '11 at 20:11

source share

3 answers

This should work for the most well-formed markup if you are not in the CDATA section and have not played nasty games that override entities:

 # nasty, ugly, illegible, unmaintable — NEVER USE THIS STYLE!!!! /<\w+(?:\s+\w+=(?:\S+|(['"])(?:(?!\1).)*?\1))*\s*\/?>/s

or more legible because

 # broken out into related elements grouped by whitespace via /x / < \w+ (?: \s+ \w+ = (?: \S+ | (['"]) (?: (?! \1) . ) *? \1 )) * \s* \/? > /xs

and even more legible:

 /   # start of tag, with named ident   < \w+   # now with unlimited k=v pairs   #    where k is \w+   #   and v is either \S+ or else quoted   (?: \s+ \w+ = (?: \S+    # either an unquoted value,   | ( ['"] )   # or else first pick either quote     (?: (?! \1) .  # anything that isn't our quote, including brackets     ) * ?     # maximal should probably work here     \1        # till we see it again         )   )  *    # as many k=v pairs as we can find   \s *    # tolerate closing whitespace   \/ ?    # XHTML style close tag   >     # finally done /xs

There is a small detachment that you could add there, for example, to transfer spaces in several places where I am not higher.

PHP is not necessarily the best language for this kind of work, although you can do this as a last resort. And, at the very least, you should hide this material in a function and / or variable somewhere, and not leave it open to all naked, consider The Children Are Watching ™.

To do something more complicated than finding oh, I don’t know the letters or spaces, patterns benefit a lot from comments and spaces. This should be taken for granted, but for some reason people forget to use /x for cognitive chunking, allowing you to associate things with spaces in the same way as with imperative code.

Even though they are declarative programs that are not imperative, even more efficient templates benefit from a complete decomposition of problems and design from top to bottom. One way to implement this is when you have regular expression routines that you declare separately from where you use them. Otherwise, you just do reuse cut and reuse code, which is code reuse for pessimal sort. Here is an example template for matching the <img> , this time using real Perl:

 my $img_rx = qr{    # save capture in $+{TAG} variable    (?<TAG> (?&image_tag) )    # remainder is pure declaration    (?(DEFINE)        (?<image_tag>            (?&start_tag)            (?&might_white)            (?&attributes)            (?&might_white)            (?&end_tag)        )        (?<attributes>            (?:                (?&might_white)                (?&one_attribute)            ) *        )        (?<one_attribute>            \b            (?&legal_attribute)            (?&might_white) = (?&might_white)            (?:                (?&quoted_value)              | (?&unquoted_value)            )        )        (?<legal_attribute>            (?: (?&required_attribute)              | (?&optional_attribute)              | (?&standard_attribute)              | (?&event_attribute)              # for LEGAL parse only, comment out next line              | (?&illegal_attribute)            )        )        (?<illegal_attribute> \b \w+ \b )        (?<required_attribute>            alt          | src        )        (?<optional_attribute>            (?&permitted_attribute)          | (?&deprecated_attribute)        )        # NB: The white space in string literals        #     below DOES NOT COUNT!   It's just        #     there for legibility.        (?<permitted_attribute>            height          | is map          | long desc          | use map          | width        )        (?<deprecated_attribute>             align           | border           | hspace           | vspace        )        (?<standard_attribute>            class          | dir          | id          | style          | title          | xml:lang        )        (?<event_attribute>            on abort          | on click          | on dbl click          | on mouse down          | on mouse out          | on key down          | on key press          | on key up        )        (?<unquoted_value>            (?&unwhite_chunk)        )        (?<quoted_value>            (?<quote>   ["']      )            (?: (?! \k<quote> ) . ) *            \k<quote>        )        (?<unwhite_chunk>              (?:                # (?! [<>'"] )                (?! > )                \S            ) +          )        (?<might_white>     \s *   )        (?<start_tag>             < (?&might_white)            img            \b              )        (?<end_tag>                     (?&html_end_tag)          | (?&xhtml_end_tag)        )        (?<html_end_tag>       >  )        (?<xhtml_end_tag>    / >  )    ) }six;

Yes, it lasts a long time, but, longer, it becomes more supported, not less. This is also more correct. Now the real program in which it is used is no longer just because you have to consider a little more than in real HTML, for example, CDATA and encodings and mischievous redefinitions of entities. However, contrary to popular belief, you can really do such things with PHP, because it uses PCRE, which allows you to use (?(DEFINE)...) blocks and recursive patterns. I have more serious examples of this kind in my answers here , here , here , here and here .

Ok, ok, did you read all this, or at least looked at them? Still with me? Hello?? Do not forget to breathe. There is, now everything will be fine. :)

Of course, there is a large gray area where the possible gives way to impractical and much faster than this gives the impossible. If these examples in these answers, not to mention them in this current one, do not match your current level of mastery with pattern matching, then you should probably use something else, which often means that someone else will do it for you.

+3

tchrist Aug 24 '11 at 21:26

source share

I think this should work ... I'll try in a minute:

edit: deleted \s+ (thanks to Peteris)

 preg_match_all('/<(\w+)[^>]*>/', $html, $matched_elements);

+1

Teneff Aug 24 '11 at 20:14

source share

nickytonline · Accepted Answer · 2011-08-24T20:14:40+0000

Modes may not always work. If you are 100% sure that it is well-formed XHTML, regular expressions may be the way to do this. If not, use some kind of PHP library for this. There is something in C # called the HTML agility package, http://htmlagilitypack.codeplex.com , for example. see How to parse HTML using regular expressions in C #? . Perhaps there is an equivalent tool in PHP.

Regex - matching tag names in HTML only

More articles: