What is the difference between getHost and getAuthority methods in a URL class in Java?

I have a number of lines (URLs) in different forms:

  • http://domain name.anything/anypath
  • https://dmain name.anything/anypath
  • http://www.domain name.anything/anypath
  • https://www.dmain name.anything/anypath

These lines are saved in a CSV file. I need to parse each URL to get only the domain name, domain name.anything . those. part after the first . and before the first.

I split the lines using the split method, then converted each line to a URL, and then used the toAuthority function to get only the domain name. The problem is that toAuthority and toHost do the same job for me, they include www. I do not want. Although the tutorial from Oracle seems toAuthority should return a domain name without www. .

How can I extract a part of a domain name only without www. Urls?

+6
source share
3 answers

What is the difference between getHost and getAuthority methods in a URL class?

To understand this, you must read the URI specification - RFC 2396 .

The short answer is that the privilege component consists of a host component along with an additional port number, username and password ... depending on the URL scheme used.


How can I extract part of a domain name only without "www." URL

You call getHost() , check to see if it starts with the string "www." , and if you delete it.

But before you begin to do such things, you need to understand that removing "www". can give you a URL that does not work, or that resolves a document or service other than the one to which the source URL is resolved. It’s a bad idea to clean up URLs for free ... unless you have detailed information on how sites are organized.

The agreement that "foo.com" and "www.foo.com" is the same place is just an agreement, and many sites do not implement it. Removing "www." would be a bad idea, as it can turn allowed URLs into URLs that are not allowed.

+13
source

It looks like you want to extract an effective second-level domain. This is easy to extract from a small number of publicly available suffixes such as .com, .net, .org, first getting the host name, as Stephen describes, and extracting a substring starting from the second period from the end. Many public suffixes, such as co.uk, will break this simple algorithm. A complete list of public suffixes can be found here: http://publicsuffix.org/ Then you can use public suffixes in the lookup table to get an effective second level.

+1
source

you can use google guava to get the domain name from the host name:

 InternetDomainName.from(hostname).topPrivateDomain().toString() 
+1
source

Source: https://habr.com/ru/post/919043/


All Articles