Javascript URL parsing issue

Question

Javascript URL parsing issue

I am trying to extract url privileges (without protocol and www. If any) and everything after it (if any). So far, my regex is:

/^(?:http|https)?(?::\/\/)?(?:www\.)?(.*?)(\/.*)/;

This works on a URL that has everything, for example:

 http://www.site.com/part1/part2?key=value#blub

But if I mark the path capture group as optional:

 /^(?:http|https)?(?::\/\/)?(?:www\.)?(.*?)(\/.*)?/

He is no longer suitable. Why?

Now, if I resolve the first option and match:

 http://site.com

it extracts : as the first value (authority) and //site.com as the second (path).

I did not expect this to work, since it has no path, and the path is not marked as optional. But still be surprised at this result, since I have only these 2 fishing groups - (.*?)(\/.*)

http://jsfiddle.net/U2tKT/1/

Can someone explain to me what happened. There are no links to a complete solution for parsing URLs, I know that there are many, but I want to understand what is wrong with my regex (and how I solve it).

Thanks.

+4

javascript regex

Ix Aug 30 '13 at 13:15

source share

3 answers

At the end of your regular expression /^(?:http|https)?(?::\/\/)?(?:www\.)?(.*?)(\/.*)?/ , (.*?) (because it has a modifier ? ), trying to match as little as possible to satisfy the regular expression. Since you added the last part of your regular expression, the parameter (.*?) Does not have to match anything to satisfy the rest of the regular expression, because (\/.*)? can not match anything. Whereas when you made the last part of your regular expression mandatory, (\/.*) , (.*?) Was forced to match enough to match (\/.*) .

+2

dg123 Aug 30 '13 at 13:21

source share

RFC3986

Internet Engineering Task Force ( IETF ) Document No. 4886, Request for Comments (RFC), called Unified Resource Identifier (URI): Generic Syntax ( RFC3986 ), is an authoritative standard that describes the exact syntax of all components that make up a valid universal uniform resource identifier (URI). Appendix B contains the regular expression:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

In this regular expression, parts of the URI are stored as follows:

scheme = $2
authority = $4
path = $5
query = $7
fragment = $9

To document the above regular expression, I took the liberty of rewriting it in free space mode with comments and indentation and presenting it here in the form of a tested PHP script that analyzes all the main parts of a given URI string:

Solution for PHP:

 <?php // test.php Rev:20130830_0800 $re_rfc3986_parse_generic_uri = '% # Parse generic URI according to RFC3986 Appendix B. ^ # Anchor to start of string. (?: # Group for optional scheme. ([^:/?#]+) # $1: Uri SCHEME. : # Scheme ends with ":". )? # Scheme is optional. (?: # Group for optional authority. // # Authority starts with "//" ([^/?#]*) # $2: Uri AUTHORITY. )? # Authority is optional. ([^?#]*) # $3: Uri PATH (required). (?: # Group for optional query. \? # Query starts with "?". ([^#]*) # $4: Uri QUERY. )? # Query is optional. (?: # Group for optional fragment. \# # Fragment starts with "#". (.*) # $5: Uri FRAGMENT. )? # Fragment is optional. $ # Anchor to end of string. %x'; $text = "http://www.site.com/part1/part2?key=value#blub"; if (preg_match($re_rfc3986_parse_generic_uri, $text, $matches)) { print_r($matches); } else { echo("String is not a valid URI"); } ?>

Two functional changes were made to the original regular expression: 1.) unnecessary capture groups were converted so that they were not captured, and 2.) the end of the end of the string character $ was added at the end of the expression. Note that an even more readable version can be created using named capture groups, rather than using numbered capture groups, but this will not be passed directly to JavaScript syntax.

PHP script Result:

Array
(
[0] => http://www.site.com/part1/part2?key=value#blub
[1] => http
[2] => www.site.com
[3] => /part1/part2
[4] => key=value
[5] => blub
)

JavaScript solution:

Here is a proven JavaScript function that decomposes a valid URI into various components:

 // Parse a valid URI into its various parts per RFC3986. function parseValidURI(text) { var uri_parts; var re_rfc3986_parse_generic_uri = /^(?:([^:\/?#]+):)?(?:\/\/([^\/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?$/; // Use String.replace() with callback function to parse the URI. text.replace(re_rfc3986_parse_generic_uri, function(m0,m1,m2,m3,m4,m5) { uri_parts = { scheme : m1, authority : m2, path : m3, query : m4, fragment : m5 }; return; // return value is not used. }); return uri_parts; }

Note that the non-path properties of the returned object may be undefined if they are not present in the URI string. In addition, if the URI string does not match this regular expression (i.e. Explicitly invalid), the return value is undefined .

Notes:

The only component of a common URI that is required is a path (which itself may be empty).
An empty string is a valid URI!
The above regex does not check the URI, but rather parses the given valid URI.
If the above expression does not match a URI string, then this string is not a valid URI. However, the opposite is not true - if the string matches the expression above, this does not mean that the URI is valid, but simply means that it can be parsed as a URI.

For those who are interested in verifying the URI and then breaking it up, I wrote an article in which all parts defined in RFC3986 Appendix A are converted to regex syntax. Cm:

Validating a URI

Happy regex!

+1

ridgerunner Aug 30 '13 at 17:06

source share

Sean johnson · Accepted Answer · 2013-08-30T13:23:13+0000

user1436026 sent JUST before I was about to press the submit button, but here goes:

Your domain template (authority) is marked as "inevitable", which matches as little as possible. And in your case, it actually satisfies the pattern so that nothing matches - it's about as small as it gets. Instead, you should have a domain match as possible until you are sure that what matches is no longer a domain (I changed the regex to match anything except how much and how much it finds.)

 /^(?:http|https)?(?::\/\/)?(?:www\.)?([^\/]+)(\/.*)?/

I know that you specifically declare that you do not want any links to any URL analysis solutions in JS, but did you know that JS is already built in? :)

 var link = document.createElement('a'); link.href="http://www.site.com/part1/part2?key=value#blub"; auth=link.hostname; //www.site.com path=link.pathname; // /part1/part2

Javascript URL parsing issue

RFC3986

Solution for PHP:

JavaScript solution:

Notes:

Validating a URI

More articles: