Replace the line between the opening and closing anchor tags with another line

I need to replace the line between two linked tags with another line. To be more clear:

<a blah blah>Click Here</a> 

I want to replace the "Click Here" tag with <img src=... /> . I read several other resources, tried very hard to use the Lars Olav Torvik regular expression tool, but it will fail!

Please help me!

+4
source share
4 answers

Do not use regex for parsing HTML!

Yes, in general, using regular expressions to parse HTML is fraught with danger. Computer scientists will correctly point out that HTML is not a REGULAR language. However, contrary to what many here consider, there are times when using the regex solution is absolutely correct and appropriate. Read Jeff Atwood's post on this topic: Html Parsing the Path of Cthulhu . Rejecting it aside, let it go forward by solving regular expressions ...

Re-statement problems:

The original question is rather vague. Here is a more accurate (perhaps not at all what the OP asks) interpretation / reformulation of the question:

Considering: We have HTML text ( HTML 4.01 or XHTML 1.0 ). This text contains <A..>...</A> anchor elements. Some of these anchor elements are links to an image file resource (i.e., the HREF attribute points to a URI that ends with a file extension: JPEG , JPG , PNG or GIF ). Some of these image links are simple text links, where the content of the anchor element is plain text without other HTML elements, for example. <a href="picture.jpg">Link text with no HTML tags</a> .

Find: Is there a regular expression that will use these links “plain-text-link-to-image-resource-file” and replace the link text with an IMG element that has a SRC set to the same image URI? The following example (valid HTML 4.01) contains three paragraphs. All links in the first paragraph should be changed, but all links in the second and third paragraphs should NOT be changed and left as is:

HTML input example:

 <p title="Image links with plain text contents to be modified"> This is a <a href="img1.png">LINK 1</a> simple anchor link to image. This <a title="<>" href="img2.jpg">LINK 2</a> has attributes before HREF. This <a href="img3.gif" title='<>'>LINK 3</a> has attributes after HREF. </p> <p title="NON-image links with plain text contents NOT to be modified"> This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image. This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF. This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF. </p> <p title="Image links with NON-plain text contents NOT to be modified"> This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image. This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image. </p> 

Required HTML output:

 <p title="Image links with plain text contents to be modified"> This is a <a href="img1.png"><img src="img1.png" /></a> simple anchor link to image. This <a title="<>" href="img2.jpg"><img src="img2.jpg" /></a> has attributes before HREF. This <a href="img3.gif" title='<>'><img src="img3.gif" /></a> has attributes after HREF. </p> <p title="NON-image links with plain text contents NOT to be modified"> This is a <a href="tmp1.txt">LINK 1</a> simple anchor link to NON-image. This <a title="<>" href="tmp2.txt">LINK 2</a> has attributes before HREF. This <a href="tmp3.txt" title='<>'>LINK 3</a> has attributes after HREF. </p> <p title="Image links with NON-plain text contents NOT to be modified"> This is a <a href="img1.png"><b>BOLD 1</b></a> anchor link to image. This is an <a href="img3.gif"><img src="img3.gif"/></a> image link to image. </p> 

Please note that these examples include the test case <A ..> <A..>...</A> Anchor tags have both single and double quote attributes values ​​both before and after the desired HREF attribute and which contain cthulhu alluring (but completely correct HTML 4.01) angle brackets.

Note also that the replacement text is an (empty) IMG tag ending in: '/>' (which is NOT valid HTML 4.01).

Regular expression solution:

In the statement of the problem, a strictly specific template is set, which must meet the following requirements:

  • The start tag <A..>...</A> can have any number of attributes before and / or after the HREF attribute.
  • The value of the HREF attribute must have a value ending in JPEG , JPG , PNG or GIF (case insensitive).
  • The content of the <A..>...</A> element cannot contain any other HTML tags.
  • The target element template <A..>...</A> NOT a nested structure.

When working with such highly defined substrings, a well-designed regular expression can work very well (with very few edge cases that can disable it). Here is a proven PHP function that will do a pretty good job (and correctly convert the input example above):

 // Convert text-only contents of image links to IMG element. function textLinksToIMG($text) { $re = '% # Match A element with image URL and text-only contents. ( # Begin $1: A element start tag. <a # Start of A element start tag. (?: # Zero or more attributes before HREF. \s+ # Whitespace required before attribute. (?!href\b) # Match attributes other than HREF. [\w\-.:]+ # Attribute name (Non-HREF). (?: # Attribute value is optional. \s*=\s* # Attrib name and value separated by =. (?: # Group for attrib value alternatives. "[^"]*" # Either double quoted, | \'[^\']*\' # or single quoted, | [\w\-.:]+ # or unquoted value. ) # End group of value alternatives. )? # Attribute value is optional. )* # Zero or more attributes before HREF. \s+ # Whitespace required before attribute. href\s*=\s* # HREF attribute name. (?| # Branch reset group for $2: HREF value. "([^"]*)" # Either $2.1: double quoted, | \'([^\']*)\' # or $2.2: single quoted, | ([\w\-.:]+) # or $2.3: unquoted value. ) # End group of HREF value alternatives. (?<= # Look behind to assert HREF value was... jpeg[\'"] # either JPEG, | jpg[\'"] # or JPG, | png[\'"] # or PNG, | gif[\'"] # or GIF, ) # End look behind assertion. (?: # Zero or more attributes after HREF. \s+ # Whitespace required before attribute. [\w\-.:]+ # Attribute name. (?: # Attribute value is optional. \s*=\s* # Attrib name and value separated by =. (?: # Group for attrib value alternatives. "[^"]*" # Either double quoted, | \'[^\']*\' # or single quoted, | [\w\-.:]+ # or unquoted value. ) # End group of value alternatives. )? # Attribute value is optional. )* # Zero or more attributes after HREF. \s* # Allow whitespace before closing > > # End of A element start tag. ) # End $1: A element start tag. ([^<>]*) # $3: A element contents (text-only). (</a\s*>) # $4: A element end tag. %ix'; return preg_replace($re, '$1<img src="$2" />$4', $text); } 

Yes, the regular expression in this solution is long, but this is mainly due to the extensive commentary, which also makes it readable. It also correctly processes quoted attribute values ​​that may contain angle brackets. Yes, of course, it’s possible to create some HTML markup that will violate this decision, but the required code for this will be so confusing as to be almost unheard of.

+6
source

You should not use regular expressions for HTML parsing. HTML is not a regular language and therefore cannot be parsed correctly with regular expressions. No matter how many things you pile into a regular expression, it can be fooled. Consider <a href=">Hello</a>">Hello</a> for example.

No matter what language you work in, there is almost certainly an HTML parsing library available for this that does it right.

Required

+4
source

If you are familiar with jQuery, this can be done quite easily as follows:

Here is the HTML code for an example script:

 <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>Untitled Document</title> <script src="http://code.jquery.com/jquery-latest.js"></script> <script> $(function(){ $("#testAnchor").html(" This is replaced image! <img src='http://www.google.com/logos/2011/newyearseve-2011-hp.jpg' />"); }); </script> </head> <body> <a href="#" id="testAnchor"> Click Here! </a> </body> </html> 

Please note that "Click here!" replaced by image and text at runtime. You can comment on the next line to see the page without replacing "Click here!"

 // $("#testAnchor").html(" This is replaced image! <img src='http://www.google.com/logos/2011/newyearseve-2011-hp.jpg' />"); 
+2
source

Well, if you really want to use regular expressions, here is the template <a[^>]*>(.*?)</a>
JavaScript code.

 var myrRegexp = /<a[^>]*>(.*?)<\/a>/i, subjectString = '<a blah blah>Click Here</a>', match = myrRegexp.exec(subjectString); if (match != null && match.length > 1) { return match[1]; } else { return = ""; } 

C # code

 string ResultString = ""; Regex RegexObj = new Regex("<a[^>]*>(.*?)</a>", RegexOptions.IgnoreCase); ResultString = RegexObj.Match(SubjectString).Groups[1].Value; 

Php

 if (preg_match('/<a[^>]*>(.*?)<\/a>/', '<a blah blah>Click Here</a>')) { } else { } 
+1
source

Source: https://habr.com/ru/post/1388723/


All Articles