File_get_contents (- Fix relative URLs

Question

File_get_contents (- Fix relative URLs

I am trying to display a website for a user by loading it with php. This is the script I use:

<?php $url = 'http://stackoverflow.com/pagecalledjohn.php'; //Download page $site = file_get_contents($url); //Fix relative URLs $site = str_replace('src="','src="' . $url,$site); $site = str_replace('url(','url(' . $url,$site); //Display to user echo $site; ?>

This script is still working, with the exception of a few serious issues with the str_replace function. The problem is with relative URLs. If we use an image on our created pagecalledjohn.php cat (something like this: Cat ) This is a png, and as I see it can be placed on a page using 6 different URLs:

 1. src="//www.stackoverflow.com/cat.png" 2. src="http://www.stackoverflow.com/cat.png" 3. src="https://www.stackoverflow.com/cat.png" 4. src="somedirectory/cat.png"

4 in this case is not applicable, but added in any case!

 5. src="/cat.png" 6. src="cat.png"

Is there a way using php, I can search src = "and replace it with the url (file name deleted) of the loaded page, but without gluing the url there, if these are options 1,2 or 3 and change the procedure a bit to 4,5 and 6 ?

+3

php regex relative-path relative-url

JBithell Apr 7 '15 at 15:54

source share

3 answers

I do not know if I understood your question correctly, if you want to deal with all text sequences enclosed in src=" and " , the following template could do this:

 ~(\ssrc=")([^"]+)(")~

It has three capture groups, the second of which contains the data you are interested in. The former and the latter are useful for changing the entire correspondence.

Now you can replace all instances with a callback function that changes places. I created a simple line with all 6 cases that you have:

 $site = <<<BUFFER 1. src="//www.stackoverflow.com/cat.png" 2. src="http://www.stackoverflow.com/cat.png" 3. src="https://www.stackoverflow.com/cat.png" 4. src="somedirectory/cat.png" 5. src="/cat.png" 6. src="cat.png" BUFFER;

Let it ignore the absence of surrounding HTML tags at some point, you don’t understand HTML anyway, I’m sure, because you didn’t ask for an HTML parser, but for a regular expression. In the following example, a match in the middle (URL) will be enclosed so that it clears it:

So now, to replace each of the links, you can easily start by simply highlighting them in the line.

 $pattern = '~(\ssrc=")([^"]+)(")~'; echo preg_replace_callback($pattern, function ($matches) { return $matches[1] . ">>>" . $matches[2] . "<<<" . $matches[3]; }, $site);

The output for the given example:

 1. src=">>>//www.stackoverflow.com/cat.png<<<" 2. src=">>>http://www.stackoverflow.com/cat.png<<<" 3. src=">>>https://www.stackoverflow.com/cat.png<<<" 4. src=">>>somedirectory/cat.png<<<" 5. src=">>>/cat.png<<<" 6. src=">>>cat.png<<<"

Since the way to replace the string must be changed, it can be extracted, so it is easier to change:

 $callback = function($method) { return function ($matches) use ($method) { return $matches[1] . $method($matches[2]) . $matches[3]; }; };

This function creates a replace callback based on the method of replacing your password as a parameter.

Such a replacement function may be:

 $highlight = function($string) { return ">>>$string<<<"; };

And it is called as follows:

 $pattern = '~(\ssrc=")([^"]+)(")~'; echo preg_replace_callback($pattern, $callback($highlight), $site);

The output remains the same, it was just to illustrate how mining works:

 1. src=">>>//www.stackoverflow.com/cat.png<<<" 2. src=">>>http://www.stackoverflow.com/cat.png<<<" 3. src=">>>https://www.stackoverflow.com/cat.png<<<" 4. src=">>>somedirectory/cat.png<<<" 5. src=">>>/cat.png<<<" 6. src=">>>cat.png<<<"

The advantage of this is that for the replacement function, you only need to deal with the match of the URL as a single string, and not with the regular expression matching the array for different groups.

Now to your second half of your question: how to replace this with specific URL handling, such as deleting a file name. This can be done by analyzing the URL itself and removing the file name (basename) from the path component. Thanks to the extraction, you can make this a simple function:

 $removeFilename = function ($url) { $url = new Net_URL2($url); $base = basename($path = $url->getPath()); $url->setPath(substr($path, 0, -strlen($base))); return $url; };

This code uses the Pear Net_URL2 URL component (also available through Packagist and Github, may also have their own OS packages). It can easily parse and modify URLs, so it’s nice to have a job.

So now the replacement is done with the new URL file name replacement function:

 $pattern = '~(\ssrc=")([^"]+)(")~'; echo preg_replace_callback($pattern, $callback($removeFilename), $site);

And then the result:

 1. src="//www.stackoverflow.com/" 2. src="http://www.stackoverflow.com/" 3. src="https://www.stackoverflow.com/" 4. src="somedirectory/" 5. src="/" 6. src=""

Please note that this is an example. It shows how you can do this with regular expressions. However, you can also use an HTML parser. Let this be the actual HTML snippet:

 1. <img src="//www.stackoverflow.com/cat.png"/> 2. <img src="http://www.stackoverflow.com/cat.png"/> 3. <img src="https://www.stackoverflow.com/cat.png"/> 4. <img src="somedirectory/cat.png"/> 5. <img src="/cat.png"/> 6. <img src="cat.png"/>

Then process all the <img> " src " attributes using the created plug-in filter function:

 $doc = new DOMDocument(); $saved = libxml_use_internal_errors(true); $doc->loadHTML($site, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); libxml_use_internal_errors($saved); $srcs = (new DOMXPath($doc))->query('//img/@hsrc') ?: []; foreach ($srcs as $src) { $src->nodeValue = $removeFilename($src->nodeValue); } echo $doc->saveHTML();

The result will again be:

 1. <img src="//www.stackoverflow.com/cat.png"> 2. <img src="http://www.stackoverflow.com/cat.png"> 3. <img src="https://www.stackoverflow.com/cat.png"> 4. <img src="somedirectory/cat.png"> 5. <img src="/cat.png"> 6. <img src="cat.png">

Another method of parsing is used - the replacement is still the same. Just to offer two different ways that also partially overlap.

+2

hakre Apr 7 '15 at 21:01

source share

I suggest doing this in a few steps.

In order not to complicate the solution, suppose that any src value is always an image (it can also be something else, for example, a script). In addition, suppose there are no spaces between equal signs and quotation marks (this can be easily removed if they exist). Finally, suppose that the file name does not contain any escaped quotes (if this was done, regexp would be more complex). Therefore, you can use the following regexp to search for all links to images: src="([^"]*)" . (In addition, this does not apply to the case where src is enclosed in single quotes, but for this it is easy to create a similar regexp. )

However, the processing logic could be done using preg_replace_callback instead of str_replace . You can provide a callback to this function, where each URL can be processed based on its contents.

So you can do something like this (not tested!):

 $site = preg_replace_callback( 'src="([^"]*)"', function ($src) { $url = $src[1]; $ret = ""; if (preg_match("^//", $url)) { // case 1. $ret = "src='" . $url . '"'; } else if (preg_match("^https?://", $url)) { // case 2. and 3. $ret = "src='" . $url . '"'; } else { // case 4., 5., 6. $ret = "src='http://your.site.com.com/" . $url . '"'; } return $ret; }, $site );

+1

Attilio Apr 07 '15 at 20:57

source share

Mike Brant · Accepted Answer · 2015-04-07 21:10

Instead of trying to change every link to a path in the source code, why don’t you just add the <base> to your header to specifically specify the base URL at which the entire relative URL should be calculated?

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

This can be achieved using the DOM management tool of your choice. The following example shows how to do this using DOMDocument and related classes.

 $target_domain = 'http://stackoverflow.com/'; $url = $target_domain . 'pagecalledjohn.php'; //Download page $site = file_get_contents($url); $dom = DOMDocument::loadHTML($site); if($dom instanceof DOMDocument === false) { // something went wrong in loading HTML to DOM Document // provide error messaging and exit } // find <head> tag $head_tag_list = $dom->getElementsByTagName('head'); // there should only be one <head> tag if($head_tag_list->length !== 1) { throw new Exception('Wow! The HTML is malformed without single head tag.'); } $head_tag = $head_tag_list->item(0); // find first child of head tag to later use in insertion $head_has_children = $head_tag->hasChildNodes(); if($head_has_children) { $head_tag_first_child = $head_tag->firstChild; } // create new <base> tag $base_element = $dom->createElement('base'); $base_element->setAttribute('href', $target_domain); // insert new base tag as first child to head tag if($head_has_children) { $base_node = $head_tag->insertBefore($base_element, $head_tag_first_child); } else { $base_node = $head_tag->appendChild($base_element); } echo $dom->saveHTML();

At least you really want to change all references to paths in the source code, I would highly recommend doing this with DOM manipulation tools (DOMDOcument, DOMXPath, etc.), and not with regex. I think you will find this a much more stable solution.

File_get_contents (- Fix relative URLs

More articles: