Regular expression to match a shared URL

I looked through everything and still haven't found a single solution to satisfy my need for a regex pattern that matches a common URL. I need to support multiple protocols (with verification), local and / or IP addressing, ports and query strings. Some examples:

Ideally, I would like the template to also support the extraction of various elements (protocol, host, port, query string, etc.), but this is not a requirement.

(Also, for me and future readers, if you could explain the pattern, it would be helpful.)

0
source share
3 answers

Nicolas Carey correctly directs you to RFC-3986. The regular expression that it specifies will match the common URI, but it won’t check for it (and this regular expression is not suitable for selecting URLs from the "wild" - it is too loose and matches almost any string, including an empty string).

Regarding the validation requirement, you can take a look at the article I wrote on this subject, which took from Appendix A all the ABNF syntactic definitions of all the various components and provides equivalent regular expressions:

Validating a URI

Regarding the question of picking URLs from the wild, look at Jeff Atwood's β€œ URL Problem ” and John 'Gruber's β€œ Improved Liberal, Accurate Regular Expression Pattern for Matching URLs ” to get an idea of ​​some of the subtle issues that may arise. Alternatively, you can take a look at the project I started last year: URL Linkification - This displays unconnected HTTP and FTP URLs from text that some links may already have.

However, the following PHP function, which uses a slightly modified version of the RFC-3986 "Absolute URI" regular expression to validate HTTP and FTP URLs (with this regular expression, the specified host part should not be empty). All of the various components of the URI are isolated and captured in named groups, making it easy to manipulate and verify details in program code:

function url_valid($url) { if (strpos($url, 'www.') === 0) $url = 'http://'. $url; if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url; if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host. ^ (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/ (?P<authority> (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)? (?P<host> (?P<IP_literal> \[ (?: (?P<IPV6address> (?: (?:[0-9A-Fa-f]{1,4}:){6} | ::(?:[0-9A-Fa-f]{1,4}:){5} | (?: [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4} | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3} | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2} | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4}: | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?:: ) (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4} | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) ) | (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?:: [0-9A-Fa-f]{1,4} | (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?:: ) | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+) ) \] ) | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)) | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+) ) (?::(?P<port>[0-9]*))? ) (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*) (?:\?(?P<query> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))? (?:\#(?P<fragment> (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))? $ /mx', $url, $m)) return FALSE; switch ($m['scheme']) { case 'https': case 'http': if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo. break; case 'ftps': case 'ftp': break; default: return FALSE; // Unrecognised URI scheme. Default to FALSE. } // Validate host name conforms to DNS "dot-separated-parts". if ($m{'regname'}) // If host regname specified, check for DNS conformance. { if (!preg_match('/# HTTP DNS host name. ^ # Anchor to beginning of string. (?!.{256}) # Overall host length is less than 256 chars. (?: # Group dot separated host part alternatives. [0-9A-Za-z]\. # Either a single alphanum followed by dot | # or... part has more than one char (63 chars max). [0-9A-Za-z] # Part first char is alphanum (no dash). [\-0-9A-Za-z]{0,61} # Internal chars are alphanum plus dash. [0-9A-Za-z] # Part last char is alphanum (no dash). \. # Each part followed by literal dot. )* # One or more parts before top level domain. (?: # Explicitly specify top level domains. com|edu|gov|int|mil|net|org|biz| info|name|pro|aero|coop|museum| asia|cat|jobs|mobi|tel|travel| [A-Za-z]{2}) # Country codes are exqactly two alpha chars. $ # Anchor to end of string. /ix', $m['host'])) return FALSE; } $m['url'] = $url; for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]); return $m; // return TRUE == array of useful named $matches plus the valid $url. } 

The first regular expression checks the string as an absolute (has a non-empty part of the host) common URI. The second regular expression is used to check part of the host (named) host (if it is not an IP literal or IPv4 address) against a DNS lookup system (where each dotted subdomain has 63 characters or less, consisting of numbers, letters and dashes with a common less than 255 characters long.)

Please note that the structure of this function allows easy expansion to include other schemes.

+2
source

Appendix B RFC 3986 / STD 0066 (Uniform Resource Identifier (URI): General Syntax) provides the required regular expression:

Appendix B. Parse Regular Expression URIs

Because the "first-match-wins" algorithm is identical to the "greedy" one, the ambiguity method used by POSIX regular expressions is natural and common to use the regular expression to parse the potential five components of a URI link.

The next line is a regular expression for decomposing a correctly formed URI reference to its components.

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9 

The numbers in the second line above are for readability only; they indicate breakpoints for each subexpression (i.e. each pair of brackets). We refer to the value corresponding to the subexpression <n> as $<n> . For example, matching the specified expression with

  http://www.ics.uci.edu/pub/ietf/uri/#Related 

leads to the following subexpressions:

  $1 = http: $2 = http $3 = //www.ics.uci.edu $4 = www.ics.uci.edu $5 = /pub/ietf/uri/ $6 = <undefined> $7 = <undefined> $8 = #Related $9 = Related 

where <undefined> indicates that the component is missing, as is the case for the request component in the above example. Therefore, we can define the meaning of the five components as

  scheme = $2 authority = $4 path = $5 query = $7 fragment = $9 

Moving in the opposite direction, we can recreate the URI reference from its component using the algorithm in Section 5.3.

Regarding checking a URI for a particular scheme, you should look at the RFC (s) describing the scheme (s) in which you are interested in order to get the details necessary to verify that the URI is valid for the scheme that it is. The URI Schema Registry is at http://www.iana.org/assignments/uri-schemes.html .

And even then you are doomed to some kind of failure. Consider the file: scheme. You cannot confirm that it represents a valid path in the authority file system (unless you ) The best you can do is confirm that it represents what looks like a valid path. And even then, the windows: url like file:///C:/foo/bar/baz/bat.txt (will) file:///C:/foo/bar/baz/bat.txt is not valid for anything other than a server using some Windows flavor. Any server running * nix will most likely strangle it (anything in any case?).

+5
source

Is this possible in Perl?

Try:

 use strict; my $url = "http://localhost/test"; if ($url =~ m/^(.+):\/\/(.+)\/(.+)/) { my $protocol = $1; my $domain = $2; my $dir = $3; print "$protocol $domain $dir \n"; } 
0
source

Source: https://habr.com/ru/post/892143/


All Articles