Match domain name to URL (www.google.com = google)

So, I want to combine only the domain from the ether:

http://www.google.com/test/
http://google.com/test/
http://google.net/test/

The output should be for all 3: google

This code works for me only for .com

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'

Then I thought it would be as easy as saying (com | net), but that does not seem true:

echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)

I was going to use a similar method to get rid of "www", but it seems that I'm doing something wrong ... (doesn’t it work with regex outside \ (\) ...)

+3
source share
5 answers

This will cause google to exit in all cases:

sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"

Edit:

URL-, " http://google.com.cn/test" " http://www.google.co.uk/", :

sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"

, "http://" ( ):

sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"
+1

Python, urlparse

import urlparse
for http in open("file"):
    o = urlparse.urlparse(http)
    d = o.netloc.split(".")
    if "www" in o.netloc:
        print d[1]
    else:
        print d[0]

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/

$ ./python.py
google
google
google

awk

awk -F"/" '{
    gsub(/http:\/\/|\/.*$/,"")
    split($0,d,".")
    if(d[1]~/www/){
        print d[2]
    }else{
        print d[1]
    }
} ' file

$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test

$ ./shell.sh
google
google
google
google
google
+1
s|http://(www\.)?([^.]*)|$2|

Perl ( ), , sed , .

0

"-r" sed? (, rep- ).

: , , . "?:" com | net - .

 echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"
0
source
#! /bin/bash

urls=(                        \
  http://www.google.com/test/ \
  http://google.com/test/     \
  http://google.net/test/     \
)

for url in ${urls[@]}; do
  echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done
0
source

Source: https://habr.com/ru/post/1731817/


All Articles