Get domain from URL to GROUP BY using MySQL

I have a table filled with urls. URLs are in all formats: http://foo.com , http://bar.foo.com , http://foo.com/bar , etc. But I'm only interested in the domain name, so in this case: foo.com. I would like to indicate how many domain names exist in this table. So something like:

SELECT "whatever the domain is in field 'url'", COUNT(*) AS count FROM table_with_urls GROUP BY "whatever the domain is in field 'url'" 

There are several similar questions about the stack overflow, but did not answer anything. I can’t use LIKE or match something with REGEXP because I don’t (always) look for specific domain names to match, but basically I just want all the domain names from the table along with the total.

Is this possible using MySQL?

+4
source share
3 answers

Add another indexed column for the "domain", and when you do INSERT, save this value separately.

+5
source

I had the same problem and this is what I did:

 select SUBSTRING(url from 1 for locate('/',url ,10)-1),count(*) from url_list group by SUBSTRING(url from 1 for locate('/',url ,10)-1); 
+8
source

If you want to install the MySQL extension, then https://github.com/StirlingMarketingGroup/mysql-get-etld-p1

It extracts basically what you expect

 select'get_etld_p1'('http://a.very.complex-domain.co.uk:8080/foo/bar');-- 'complex-domain.co.uk' select'get_etld_p1'('https://www.bbc.co.uk/');-- 'bbc.co.uk' select'get_etld_p1'('https://github.com/StirlingMarketingGroup/');-- 'github.com' select'get_etld_p1'('https://localhost:10000/index');-- 'localhost' select'get_etld_p1'('android-app://com.google.android.gm');-- 'com.google.android.gm' select'get_etld_p1'('example.test.domain.com');-- 'domain.com' select'get_etld_p1'('postgres://user: pass@host.com :5432/path?k=v#f');-- 'host.com' select'get_etld_p1'('exzvk.omsk.so-ups.ru');-- 'so-ups.ru' select'get_etld_p1'('http://10.64.3.5/data_check/index.php?r=index/rawdatacheck');-- '10.64.3.5' select'get_etld_p1'('not a domain');-- null 

Then, if you want this to be fast, you can create a second denormalized column that only stores these values, something like

 CREATE TABLE 'db'.'sometablewithurls' ( 'SomeTableWithURLsID' INT UNSIGNED NOT NULL AUTO_INCREMENT, 'URL' TEXT NOT NULL DEFAULT '', '_ETLDP1' VARCHAR(255) NOT NULL DEFAULT '', PRIMARY KEY ('SomeTableWithURLsID'), INDEX '_ETLDP1' ('_ETLDP1' ASC)); DROP TRIGGER IF EXISTS 'db'.'sometablewithurls_BEFORE_INSERT'; DELIMITER $$ USE 'db'$$ CREATE DEFINER = CURRENT_USER TRIGGER 'db'.'sometablewithurls_BEFORE_INSERT' BEFORE INSERT ON 'sometablewithurls' FOR EACH ROW BEGIN set new.'_ETLDP1'=ifnull('get_etld_p1'(new.'URL'),''); END$$ DELIMITER ; DROP TRIGGER IF EXISTS 'db'.'sometablewithurls_BEFORE_UPDATE'; DELIMITER $$ USE 'db'$$ CREATE DEFINER = CURRENT_USER TRIGGER 'db'.'sometablewithurls_BEFORE_UPDATE' BEFORE UPDATE ON 'sometablewithurls' FOR EACH ROW BEGIN set new.'_ETLDP1'=ifnull('get_etld_p1'(new.'URL'),''); END$$ DELIMITER ; 

Pay attention to the _ETLDP1 index (stands for Extended Top Level Domain Plus 1) and the trigger updates it both during insertion and when updating to be aware, even if the URL changes.

0
source

Source: https://habr.com/ru/post/1299474/


All Articles