You need a regular expression to capture a second level domain (SLD)

I need a regex to capture the given SLD URLs.

Examples:

jack.bop.com -> bop bop.com -> bop bop.de -> bop bop.co.uk -> bop bop.com.br -> bop 

All bop :). Therefore, this regular expression needs to ignore ccTLD, gTLD, and ccSLD. The latter is the hard part, since I want to keep the regex as complex as possible.

The first task would be to remove ccTLD and then gTLD, then check ccSLD and delete them, if any.

Any help is much appreciated :)

-

If this helps, ccTLDs are mapped:

 \.([az]{2})$ 

And gTLDs are mapped:

 \.([az]{3-6})$ 

Fortunately, these are two mutually exclusive patterns.

+1
source share
1 answer

Technically, ".co.uk" is a second level domain in "bop.co.uk". What you seem to be requesting is a top-level domain of a domain open for public registration. I do not know if this is really a good name. This, of course, is not very clearly defined.

To find the right thing, you will need to list all the "do not open for public registration" suffixes. You should probably order them from the longest to the shortest to handle cases like "www.british-library.uk". After that, the regex is pretty simple:

 (.+\.)?([^.]+)\.(?:<suffixes>)$ 

Where <suffixes> will be yours | separated list of suffixes. A piece of it will look like this:

 gov\.uk|ac\.uk|co\.uk|com|org|net|us|uk 

Again, you want to order these longest first (more precisely, the real restriction - you want the elements that are suffixes of other elements to appear later - the longest way is an easy way to satisfy this restriction).

You can find a list of domains that you care about by researching how cookie domains are handled by web browsers. I seem to remember that browsers make some special shells to ensure that you cannot have cookies that exist for all co.uk.

+3
source

Source: https://habr.com/ru/post/914768/


All Articles