Finding and replacing URLs in a block of text, but excluding tags in link tags

Question

Finding and replacing URLs in a block of text, but excluding tags in link tags

I am trying to run a string and find and replace the URLs with a link, this is what I have so far found and it seems that it works for the most part not bad, however there are a few things that I do. I like polishing. Also, this may not be the most efficient way to do this.

I read a lot of threads on this here on SO, and although it helped a lot, I still need to tie loose ends to it.

I run the line two times. The first time I replace bbtags with html tags; and the second time I run the line and replace the urls with text:

$body_str = preg_replace('/\[url=(.+?)\](.+?)\[\/url\]/i', '<a href="\1" rel="nofollow" target="_blank">\2</a>', $body_str); $body_str = preg_replace_callback( '!(?:^|[^"\'])(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?!', function ($matches) { return strpos(trim($matches[0]), 'thisone.com') == FALSE ? '<a href="' . ltrim($matches[0], " \t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '" rel="nofollow" target="_blank">' . ltrim($matches[0], "\t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '</a>' : '<a href="' . ltrim($matches[0], " \t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '">' . ltrim($matches[0], "\t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '</a>'; }, $body_str );

So far, a few problems that I find with this are, as a rule, picking up a character immediately before "http", etc., for example. space / comma / colon etc. which broke the links. So I used preg_replace_callback to get around this and trim some unwanted characters that would break the link.

Another problem is to avoid breaking links by matching URLs that are already in A tags, I currently exclude URLs starting with quotation, double quotation marks, and I would rather use href = '| href = for an exception.

Any tips and tricks would be much appreciated

+4

php regex html-parsing preg-replace-callback

user1781508 Aug 08 '13 at 10:36

source share

1 answer

Thibault · Answer 1 · 2013-08-09T11:51:49+0000

At first, I allowed myself to rework your code a bit to make it easier to read and modify:

 function urltrim ($ str) {
    return ltrim ($ str, "\ t \ n \ r \ 0 \ x0B., @? ^ =% &: / ~ \ + # '");
 }
 function addlink ($ str, $ nofollow = true) {
         return '& lta href = "'. urltrim ($ str). '"'. ($ nofollow? 'rel = "nofollow" target = "_ blank"': ''). '>'.  urltrim ($ str).  '& lt / a>';
 }
 function checksite ($ str) {
         return strpos (trim ($ str), 'thisone.com') == FALSE?  addlink ($ str): addlink ($ str, false);
 }

 $ body_str = preg_replace ('/ \ [url = (. +?) \] (. +?) \ [\ / url \] / i', '\ 2', $ body_str);

 $ body_str = preg_replace_callback (
     '! (?: ^ | [^ "\']) (http | https): \ / \ / [\ w \ -_] + (\. [\ w \ -_] +) + ([\ w \ - \., @? ^ =% &: / ~ \ + #] * [\ w \ - \ @? ^ =% & / ~ \ + #]) ?! ',
        function ($ matches) {
         return checksite ($ matches [0]);
     },

     $ body_str
 );

After that, I changed the way links are handled:

I believed that URL is a word (= all characters until you find a space or \ n or \ t (= \ s))
I changed the matching method to match the existence of href = at the beginning of the line
- if it exists, then I do nothing, this is already a link
- If there is no href =, then I will replace the link
So the urltrim method is no longer useful since I don't eat the first char before http
And of course I use urlencode to encode the url and exclude html injections

  function urltrim ($ str) {
     return $ str;
 }
 function addlink ($ str, $ nofollow = true) {
         $ url = preg_replace ("# (https?)% 3A% 2F% 2F #", "$ 1: //", urlencode (urltrim ($ str)));
         return '<a href="'. $url.'"'.($nofollow?' rel="nofollow" target="_blank"':'').'> '.  urltrim ($ str).  '</a>';
 }
 function checksite ($ str) {
         return strpos (trim ($ str), 'thisone.com') == FALSE?  addlink ($ str): addlink ($ str, false);
 }

 $ body_str = preg_replace ('/ \ [url = (. +?) \] (. +?) \ [\ / url \] / i', '\ 2', $ body_str);

 $ body_str = preg_replace_callback (
     '! (| href =) (["\']?) (https?: // [^ \ s] +)! ',
     function ($ matches) {
         if ($ matches [1]) {
             # If href = is present, dont do anything, return the original string
             return $ matches [0];
         } else {
             # add the previous char ("or ') and the link
             return $ matches [2] .checksite ($ matches [3]);
         }
     },
     $ body_str
 );

Hope this helps you in your project. Let us know if they helped.

Bye

Finding and replacing URLs in a block of text, but excluding tags in link tags

More articles: