This should work for the most well-formed markup if you are not in the CDATA section and have not played nasty games that override entities:
# nasty, ugly, illegible, unmaintable β NEVER USE THIS STYLE!!!! /<\w+(?:\s+\w+=(?:\S+|(['"])(?:(?!\1).)*?\1))*\s*\/?>/s
or more legible because
# broken out into related elements grouped by whitespace via /x / < \w+ (?: \s+ \w+ = (?: \S+ | (['"]) (?: (?! \1) . ) *? \1 )) * \s* \/? > /xs
and even more legible:
/
There is a small detachment that you could add there, for example, to transfer spaces in several places where I am not higher.
PHP is not necessarily the best language for this kind of work, although you can do this as a last resort. And, at the very least, you should hide this material in a function and / or variable somewhere, and not leave it open to all naked, consider The Children Are Watching β’.
To do something more complicated than finding oh, I donβt know the letters or spaces, patterns benefit a lot from comments and spaces. This should be taken for granted, but for some reason people forget to use /x for cognitive chunking, allowing you to associate things with spaces in the same way as with imperative code.
Even though they are declarative programs that are not imperative, even more efficient templates benefit from a complete decomposition of problems and design from top to bottom. One way to implement this is when you have regular expression routines that you declare separately from where you use them. Otherwise, you just do reuse cut and reuse code, which is code reuse for pessimal sort. Here is an example template for matching the <img> , this time using real Perl:
my $img_rx = qr{ # save capture in $+{TAG} variable (?<TAG> (?&image_tag) ) # remainder is pure declaration (?(DEFINE) (?<image_tag> (?&start_tag) (?&might_white) (?&attributes) (?&might_white) (?&end_tag) ) (?<attributes> (?: (?&might_white) (?&one_attribute) ) * ) (?<one_attribute> \b (?&legal_attribute) (?&might_white) = (?&might_white) (?: (?"ed_value) | (?&unquoted_value) ) ) (?<legal_attribute> (?: (?&required_attribute) | (?&optional_attribute) | (?&standard_attribute) | (?&event_attribute) # for LEGAL parse only, comment out next line | (?&illegal_attribute) ) ) (?<illegal_attribute> \b \w+ \b ) (?<required_attribute> alt | src ) (?<optional_attribute> (?&permitted_attribute) | (?&deprecated_attribute) ) # NB: The white space in string literals # below DOES NOT COUNT! It's just # there for legibility. (?<permitted_attribute> height | is map | long desc | use map | width ) (?<deprecated_attribute> align | border | hspace | vspace ) (?<standard_attribute> class | dir | id | style | title | xml:lang ) (?<event_attribute> on abort | on click | on dbl click | on mouse down | on mouse out | on key down | on key press | on key up ) (?<unquoted_value> (?&unwhite_chunk) ) (?<quoted_value> (?<quote> ["'] ) (?: (?! \k<quote> ) . ) * \k<quote> ) (?<unwhite_chunk> (?: # (?! [<>'"] ) (?! > ) \S ) + ) (?<might_white> \s * ) (?<start_tag> < (?&might_white) img \b ) (?<end_tag> (?&html_end_tag) | (?&xhtml_end_tag) ) (?<html_end_tag> > ) (?<xhtml_end_tag> / > ) ) }six;
Yes, it lasts a long time, but, longer, it becomes more supported, not less. This is also more correct. Now the real program in which it is used is no longer just because you have to consider a little more than in real HTML, for example, CDATA and encodings and mischievous redefinitions of entities. However, contrary to popular belief, you can really do such things with PHP, because it uses PCRE, which allows you to use (?(DEFINE)...) blocks and recursive patterns. I have more serious examples of this kind in my answers here , here , here , here and here .
Ok, ok, did you read all this, or at least looked at them? Still with me? Hello?? Do not forget to breathe. There is, now everything will be fine. :)
Of course, there is a large gray area where the possible gives way to impractical and much faster than this gives the impossible. If these examples in these answers, not to mention them in this current one, do not match your current level of mastery with pattern matching, then you should probably use something else, which often means that someone else will do it for you.