Do not use regex for parsing HTML!
Yes, in general, using regular expressions to parse HTML is fraught with danger. Computer scientists will correctly point out that HTML is not a REGULAR language. However, contrary to what many here consider, there are times when using the regex solution is absolutely correct and appropriate. Read Jeff Atwood's post on this topic: Html Parsing the Path of Cthulhu . Rejecting it aside, let it go forward by solving regular expressions ...
Re-statement problems:
The original question is rather vague. Here is a more accurate (perhaps not at all what the OP asks) interpretation / reformulation of the question:
Considering: We have HTML text ( HTML 4.01 or XHTML 1.0 ). This text contains <A..>...</A> anchor elements. Some of these anchor elements are links to an image file resource (i.e., the HREF attribute points to a URI that ends with a file extension: JPEG , JPG , PNG or GIF ). Some of these image links are simple text links, where the content of the anchor element is plain text without other HTML elements, for example. <a href="picture.jpg">Link text with no HTML tags</a> .
Find: Is there a regular expression that will use these links “plain-text-link-to-image-resource-file” and replace the link text with an IMG element that has a SRC set to the same image URI? The following example (valid HTML 4.01) contains three paragraphs. All links in the first paragraph should be changed, but all links in the second and third paragraphs should NOT be changed and left as is:
HTML input example:
<p title="Image links with plain text contents to be modified"> This is a <a href="img1.png">LINK 1</a> simple anchor link to image. This <a title="<>" href="img2.jpg">LINK 2</a> has attributes before HREF. This <a href="img3.gif" title='<>'>LINK 3</a> has attributes after HREF. </p> <p title="NON-image links with plain text contents NOT to be modified"> This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image. This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF. This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF. </p> <p title="Image links with NON-plain text contents NOT to be modified"> This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image. This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image. </p>
Required HTML output:
<p title="Image links with plain text contents to be modified"> This is a <a href="img1.png"><img src="img1.png" /></a> simple anchor link to image. This <a title="<>" href="img2.jpg"><img src="img2.jpg" /></a> has attributes before HREF. This <a href="img3.gif" title='<>'><img src="img3.gif" /></a> has attributes after HREF. </p> <p title="NON-image links with plain text contents NOT to be modified"> This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image. This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF. This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF. </p> <p title="Image links with NON-plain text contents NOT to be modified"> This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image. This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image. </p>
Please note that these examples include the test case <A ..> <A..>...</A> Anchor tags have both single and double quote attributes values both before and after the desired HREF attribute and which contain cthulhu alluring (but completely correct HTML 4.01) angle brackets.
Note also that the replacement text is an (empty) IMG tag ending in: '/>' (which is NOT valid HTML 4.01).
Regular expression solution:
In the statement of the problem, a strictly specific template is set, which must meet the following requirements:
- The start tag
<A..>...</A> can have any number of attributes before and / or after the HREF attribute. - The value of the
HREF attribute must have a value ending in JPEG , JPG , PNG or GIF (case insensitive). - The content of the
<A..>...</A> element cannot contain any other HTML tags. - The target element template
<A..>...</A> NOT a nested structure.
When working with such highly defined substrings, a well-designed regular expression can work very well (with very few edge cases that can disable it). Here is a proven PHP function that will do a pretty good job (and correctly convert the input example above):
Yes, the regular expression in this solution is long, but this is mainly due to the extensive commentary, which also makes it readable. It also correctly processes quoted attribute values that may contain angle brackets. Yes, of course, it’s possible to create some HTML markup that will violate this decision, but the required code for this will be so confusing as to be almost unheard of.