How to check the correct url using `urlparse`?

I want to check if the URL is valid before opening it to read data.

I used the urlparse function from the urlparse package:

 if not bool(urlparse.urlparse(url).netloc): # do something like: open and read using urllin2 

However, I noticed that some valid URLs are considered broken, for example:

 url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png 

This url is valid (I can open it using my browser).

Is there a better way to check if the URL is valid?

+6
source share
3 answers

You can check if the URL has a scheme:

 >>> url = "no.scheme.com/math/12345.png" >>> parsed_url = urlparse.urlparse(url) >>> bool(parsed_url.scheme) False 

If so, you can replace the scheme and get the real valid URL:

 >>> parsed_url.geturl() "no.scheme.com/math/12345.png" >>> parsed_url = parsed_url._replace(**{"scheme": "http"}) >>> parsed_url.geturl() 'http:///no.scheme.com/math/12345.png' 
+7
source

You can try the function below which the scheme , netloc and path variables that appear after parsing the URL are checked. Supports both Python 2 and 3.

 try: # python 3 from urllib.parse import urlparse except ImportError: from urlparse import urlparse def url_validator(url): try: result = urlparse(url) return all([result.scheme, result.netloc, result.path]) except: return False 
+2
source

Url without a scheme is actually invalid, your browser is smart enough to offer http: // as a scheme for it. This might be a good solution to check if the URL of the scheme ( not re.match(r'^[a-zA-Z]+://', url) ) and prepend http:// does not have.

+1
source

Source: https://habr.com/ru/post/973699/


All Articles