RFC3986
Internet Engineering Task Force ( IETF ) Document No. 4886, Request for Comments (RFC), called Unified Resource Identifier (URI): Generic Syntax ( RFC3986 ), is an authoritative standard that describes the exact syntax of all components that make up a valid universal uniform resource identifier (URI). Appendix B contains the regular expression:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
In this regular expression, parts of the URI are stored as follows:
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
To document the above regular expression, I took the liberty of rewriting it in free space mode with comments and indentation and presenting it here in the form of a tested PHP script that analyzes all the main parts of a given URI string:
Solution for PHP:
<?php
Two functional changes were made to the original regular expression: 1.) unnecessary capture groups were converted so that they were not captured, and 2.) the end of the end of the string character $ was added at the end of the expression. Note that an even more readable version can be created using named capture groups, rather than using numbered capture groups, but this will not be passed directly to JavaScript syntax.
PHP script Result:
Array
(
[0] => http://www.site.com/part1/part2?key=value#blub
[1] => http
[2] => www.site.com
[3] => /part1/part2
[4] => key=value
[5] => blub
)
JavaScript solution:
Here is a proven JavaScript function that decomposes a valid URI into various components:
Note that the non-path properties of the returned object may be undefined if they are not present in the URI string. In addition, if the URI string does not match this regular expression (i.e. Explicitly invalid), the return value is undefined .
Notes:
- The only component of a common URI that is required is a path (which itself may be empty).
- An empty string is a valid URI!
- The above regex does not check the URI, but rather parses the given valid URI.
- If the above expression does not match a URI string, then this string is not a valid URI. However, the opposite is not true - if the string matches the expression above, this does not mean that the URI is valid, but simply means that it can be parsed as a URI.
For those who are interested in verifying the URI and then breaking it up, I wrote an article in which all parts defined in RFC3986 Appendix A are converted to regex syntax. Cm:
Happy regex!
source share