Preg_replace all links in file_get_contents not containing words

I am reading a page in a variable, and I would like to disable all links that do not contain the word "means" in the address. The code that I still capture all the links, including those that have a "tool." What am I doing wrong?

$page = preg_replace('~<a href=".*?(?!remedy).*?".*?>(.*?)</a>~i', '<font color="#808080">$1</font>', $page); 

- decision -

 $page = preg_replace('~<a href="(.(?!remedy))*?".*?>(.*?)</a>~i', '<font color="#808080">$2</font>', $page); 
+4
source share
2 answers

Try ~<a href="(.(?!remedy))*?".*?>(.*?)</a>~i

To the question of what you are doing wrong: regular matches are always, if possible, and for each URL (even with remedy ) you can match '~<a href=".*?(?!remedy).*?".*?>(.*?)</a>~i' , because you didn’t indicate remedy , possibly not contained anywhere in the attribute, but you indicated that there should be nothing / nothing ( .*? ), which is not follows remedy , and this applies to any URL except those starting with exactly <a href="remedy" . I hope you can understand that ...

+3
source

I would probably use this:

 <a href="(?:(?!remedy)[^"])*"[^>]*>([^<]*)</a> 

The most interesting part:

 "(?:(?!remedy)[^"])*" 

Each time [^"] is about to consume a different character, it returns to the view, so it confirms that this is not the first character of the word remedy . Use [^"] instead . doesn't let him look at anything other than a closing quote. I also took the liberty of replacing yours .*? to negative character classes. This serves the same purpose as “corralled” matching in the area where you want it matching. It is also more efficient and more sustainable.

Of course, I assume that the content of the <a> element is plain text, which does not contain more elements inside it. In fact, this is just one of many simplifying assumptions I have made. You cannot match HTML with regular expressions without them.

0
source

Source: https://habr.com/ru/post/1480376/


All Articles