What happens if <base href ...> is set with a double slash?
I like to understand how to use the <base href="" /> value for my web crawler, so I checked several combinations with the main browsers and finally found something with double slashes that I don't understand.
If you do not like to read everything, go to the test results D and E. Demonstration of all tests:
http://gutt.it/basehref.php
Step by step, my test results when calling http://example.com/images.html :
A - Several basic href
<html> <head> <base target="_blank" /> <base href="http://example.com/images/" /> <base href="http://example.com/" /> </head> <body> <img src="/images/image.jpg"> <img src="image.jpg"> <img src="./image.jpg"> <img src="images/image.jpg"> not found <img src="/image.jpg"> not found <img src="../image.jpg"> not found </body> </html> Conclusion
- only the first
<base>withhrefcounts - a source starting with
/targets the root ../one folder goes up
B - Without trailing
<html> <head> <base href="http://example.com/images" /> </head> <body> <img src="/images/image.jpg"> <img src="image.jpg"> not found <img src="./image.jpg"> not found <img src="images/image.jpg"> <img src="/image.jpg"> not found <img src="../image.jpg"> not found </body> </html> Conclusion
<base href>ignores everything after the last slash, sohttp://example.com/imagesbecomeshttp://example.com/
C - How It Should Be
<html> <head> <base href="http://example.com/" /> </head> <body> <img src="/images/image.jpg"> <img src="image.jpg"> not found <img src="./image.jpg"> not found <img src="images/image.jpg"> <img src="/image.jpg"> not found <img src="../image.jpg"> not found </body> </html> Conclusion
- Same result as Test B as expected
D - Double Slash
<html> <head> <base href="http://example.com/images//" /> </head> <body> <img src="/images/image.jpg"> <img src="image.jpg"> <img src="./image.jpg"> <img src="images/image.jpg"> not found <img src="/image.jpg"> not found <img src="../image.jpg"> </body> </html> E - Double slash with spaces
<html> <head> <base href="http://example.com/images/ /" /> </head> <body> <img src="/images/image.jpg"> <img src="image.jpg"> not found <img src="./image.jpg"> not found <img src="images/image.jpg"> not found <img src="/image.jpg"> not found <img src="../image.jpg"> </body> </html> Both are not "valid" URLs, but real results from my web crawler. Please explain what happened in D and E that can be found ../image.jpg and why the reason for this is the difference?
For your interest only:
<base href="http://example.com//" />matches Test C<base href="http://example.com/ /" />completely different. Found only../image.jpg<base href="a/" />only finds/images/image.jpg
The base behavior is explained in the HTML specification:
The
baseelement allows authors to specify a document base URL for resolving relative URLs .
As shown in your test A, if there is multiple base with href , the base URL of the document will be the first one.
Relative URL resolution is as follows:
Apply the URL parser to the URL, with the base as the base URL, with the encoding as the encoding.
The URL parsing algorithm is defined in the URL specification.
It is too difficult to explain here in detail. But basically, this is what happens:
- A relative URL starting with
/is computed relative to the host of the base URL. - Otherwise, the relative URL is calculated relative to the base directory of the last URL.
- Keep in mind that if the base path does not end with
/, the last part will be a file, not a directory. ./- current directory../one directory goes up
(Probably the "directory" and the "file" are not the correct terminology in the urls)
Some examples:
http://example.com/images/a/./http://example.com/images/a/http://example.com/images/a/../http://example.com/images/http://example.com/images//./http://example.com/images//http://example.com/images//../http://example.com/images/http://example.com/images/./-http://example.com/images/http://example.com/images/../http://example.com/
Note that in most cases // will look like / . As @poncha said ,
If you do not use any rewriting of URLs (in this case, the number of slashes can affect the rewriting rules), the uri of the map is in the path on the disk, but in (most?) Modern operating systems (Linux / Unix, Windows), several path separators in a line do not have any special meaning, therefore / path / to / foo and / path // in //// foo would ultimately map the same file.
However, generally speaking, / / will not // .
You can use the following snippet to allow a list of relative URLs for absolute:
var bases = [ "http://example.com/images/", "http://example.com/images", "http://example.com/", "http://example.com/images//", "http://example.com/images/ /" ]; var urls = [ "/images/image.jpg", "image.jpg", "./image.jpg", "images/image.jpg", "/image.jpg", "../image.jpg" ]; function newEl(type, contents) { var el = document.createElement(type); if(!contents) return el; if(!(contents instanceof Array)) contents = [contents]; for(var i=0; i<contents.length; ++i) if(typeof contents[i] == 'string') el.appendChild(document.createTextNode(contents[i])) else if(typeof contents[i] == 'object') // contents[i] instanceof Node el.appendChild(contents[i]) return el; } function emoticon(str) { return { 'http://example.com/images/image.jpg': 'good', 'http://example.com/images//image.jpg': 'neutral' }[str] || 'bad'; } var base = document.createElement('base'), a = document.createElement('a'), output = document.createElement('ul'), head = document.getElementsByTagName('head')[0]; head.insertBefore(base, head.firstChild); for(var i=0; i<bases.length; ++i) { base.href = bases[i]; var test = newEl('li', [ 'Test ' + (i+1) + ': ', newEl('span', bases[i]) ]); test.className = 'test'; var testItems = newEl('ul'); testItems.className = 'test-items'; for(var j=0; j<urls.length; ++j) { a.href = urls[j]; var absURL = a.cloneNode(false).href; /* Stupid old IE requires cloning https://stackoverflow.com/a/24437713/1529630 */ var testItem = newEl('li', [ newEl('span', urls[j]), ' โ ', newEl('span', absURL) ]); testItem.className = 'test-item ' + emoticon(absURL); testItems.appendChild(testItem); } test.appendChild(testItems); output.appendChild(test); } document.body.appendChild(output); span { background: #eef; } .test-items { display: table; border-spacing: .13em; padding-left: 1.1em; margin-bottom: .3em; } .test-item { display: table-row; position: relative; list-style: none; } .test-item > span { display: table-cell; } .test-item:before { display: inline-block; width: 1.1em; height: 1.1em; line-height: 1em; text-align: center; border-radius: 50%; margin-right: .4em; position: absolute; left: -1.1em; top: 0; } .good:before { content: ':)'; background: #0f0; } .neutral:before { content: ':|'; background: #ff0; } .bad:before { content: ':('; background: #f00; } You can also play with this snippet:
var resolveURL = (function() { var base = document.createElement('base'), a = document.createElement('a'), head = document.getElementsByTagName('head')[0]; return function(url, baseurl) { if(base) { base.href = baseurl; head.insertBefore(base, head.firstChild); } a.href = url; var abs = a.cloneNode(false).href; /* Stupid old IE requires cloning https://stackoverflow.com/a/24437713/1529630 */ if(base) head.removeChild(base); return abs; }; })(); var base = document.getElementById('base'), url = document.getElementById('url'), abs = document.getElementById('absolute'); base.onpropertychange = url.onpropertychange = function() { if (event.propertyName == "value") update() }; (base.oninput = url.oninput = update)(); function update() { abs.value = resolveURL(url.value, base.value); } label { display: block; margin: 1em 0; } input { width: 100%; } <label> Base url: <input id="base" value="http://example.com/images//foo////bar/baz" placeholder="Enter your base url here" /> </label> <label> URL to be resolved: <input id="url" value="./a/b/../c" placeholder="Enter your URL here"> </label> <label> Resulting url: <input id="absolute" readonly> </label>