How to check url in python? (Wrong or not)

I have a url from the user and I have to respond with the extracted HTML.

How can I check if the url is invalid?

Example:

 url='google' // Malformed url='google.com' // Malformed url='http://google.com' // Valid url='http://google' // Malformed 

How can we achieve this?

+69
python url malformedurlexception
Aug 23 2018-11-12T00:
source share
8 answers

Checking django url:

 regex = re.compile( r'^(?:http|ftp)s?://' # http:// or https:// r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[AZ]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain... r'localhost|' #localhost... r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip r'(?::\d+)?' # optional port r'(?:/?|[/?]\S+)$', re.IGNORECASE) print re.match(regex, "http://www.example.com") is not None # True print re.match(regex, "example.com") is not None # False 
+57
Aug 23 2018-11-12T00:
source share

Actually, I think this is the best way.

 from django.core.validators import URLValidator from django.core.exceptions import ValidationError val = URLValidator(verify_exists=False) try: val('http://www.google.com') except ValidationError, e: print e 

If you set verify_exists to True , it will really check if the URL exists, otherwise it will just check if it is correctly formed.

edit: ah yeah, this question is a duplicate of this: How to check if a URL exists using Djangos validators?

+103
Aug 23 '11 at 12:10
source share

Use validators package:

 >>> import validators >>> validators.url("http://google.com") True >>> validators.url("http://google") ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True}) >>> if not validators.url("http://google"): ... print "not valid" ... not valid >>> 

Install it using pip ( pip install validators ).

+81
Aug 23 '15 at 21:46
source share

True or false version based on @DMfll answer:

 try: # python2 from urlparse import urlparse except: # python3 from urllib.parse import urlparse a = 'http://www.cwi.nl:80/%7Eguido/Python.html' b = '/data/Python.html' c = 532 d = u'dkakasdkjdjakdjadjfalskdjfalk' def uri_validator(x): try: result = urlparse(x) return all([result.scheme, result.netloc, result.path]) except: return False print(uri_validator(a)) print(uri_validator(b)) print(uri_validator(c)) print(uri_validator(d)) 

gives:

 True True False True 
+30
Jun 24 '16 at 18:37
source share

note - lepl is no longer supported, sorry (you can use it, and I think the code below works, but it will not receive updates).

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http addresses and email). I followed my recommendations in python using lepl (parser library). see http://acooke.org/lepl/rfc3696.html

for use:

 > easy_install lepl ... > python ... >>> from lepl.apps.rfc3696 import HttpUrl >>> validator = HttpUrl() >>> validator('google') False >>> validator('http://google') False >>> validator('http://google.com') True 
+8
Aug 24 2018-11-21T00:
source share

I got to this page trying to find a reasonable way to check strings as β€œvalid” URLs. I will share my solution here using python3. No additional libraries are required.

See https://docs.python.org/2/library/urlparse.html if you are using python2.

See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3, just like me.

 import urllib from pprint import pprint invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk' valid_url = 'http://qaru.site/' tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)] for token in tokens: pprint(token) min_attributes = ('scheme', 'netloc') # add attrs to your liking for token in tokens: if not all([getattr(token, attr) for attr in min_attributes]): error = "'{url}' string has no scheme or netloc.".format(url=token.geturl()) print(error) else: print("'{url}' is probably a valid url.".format(url=token.geturl())) 

ParseResult (scheme = ``, netloc = '', path = 'dkakasdkjdjakdjadjfalskdjfalk', params = '', query = '', fragment = '')

ParseResult (scheme = 'https', netloc = 'stackoverflow.com', path = '', params = '', query = '', fragment = '')

The string 'dkakasdkjdjakdjadjfalskdjfalk' does not have a schema or netloc.

/qaru.site / ... is probably a valid URL.

Here is a more concise function:

 import urllib min_attributes = ('scheme', 'netloc') def is_valid(url, qualifying=None): qualifying = min_attributes if qualifying is None else qualifying token = urllib.parse.urlparse(url) return all([getattr(token, qualifying_attr) for qualifying_attr in qualifying]) 
+7
Mar 29 '16 at 11:52
source share

I am currently using the following based on Padam's answer:

 $ python --version Python 3.6.5 

And here is what it looks like:

 from urllib.parse import urlparse def is_url(url): try: result = urlparse(url) return all([result.scheme, result.netloc]) except ValueError: return False 

Just use is_url("http://www.asdf.com") .

Hope it helps!

+4
Sep 22 '18 at 10:55
source share

EDIT

As @Kwame points out, the code below does validate the URL even if .com or .co etc. are .co .

@Blaise also indicated that URLs such as https://www.google are valid and you need to do a DNS check to see if it allows or not, separately.

It is simple and works:

Thus, min_attr contains the basic set of strings that must be present to determine the validity of the URL, i.e. http:// part and google.com part.

urlparse.scheme stores http:// and

urlparse.netloc stores the google.com domain name

 from urlparse import urlparse def url_check(url): min_attr = ('scheme' , 'netloc') try: result = urlparse(url) if all([result.scheme, result.netloc]): return True else: return False except: return False 

all() returns true if all the variables inside it return true. Thus, if result.scheme and result.netloc are present, i.e. have some value, then the url is valid and therefore returns true.

+2
Jul 12 '17 at 6:58
source share



All Articles