Remove duplicate domains from a list using regular expressions

Question

Remove duplicate domains from a list using regular expressions

I would like to use PCRE to take a list of URIs and separate it.

Start

http://abcd.tld/products/widget1       
http://abcd.tld/products/widget2    
http://abcd.tld/products/review    
http://1234.tld/

Done

http://abcd.tld/products/widget1
http://1234.tld/

Any ideas, dear StackOverflow members?

+1

string uri regex text

Ezekiel templin Feb 17 '10 at 12:49

source share

5 answers

While INSANELY is inefficient, this can be done ...

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

Please do not use this

+2

Diadistis Feb 17 '10 at 13:01

source share

URI, . URL-, , .

Ruby:

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links

+1

Lolindrath Feb 17 '10 at 13:20

source share

If you can work with the whole file as one line, and not from line to line, then why this should not be like this work. (I'm not sure about char ranges.)

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!

0

dubiousjim Feb 17 '10 at 12:59

source share

if you have (g) awk on your system

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

Output

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/

0

ghostdog74 Feb 17 '10 at 13:10

source share

Ofir · Accepted Answer · 2010-02-17T12:53:08+0000

You can use simple tools like uniq .

See the kobi example in the comments:

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq

Remove duplicate domains from a list using regular expressions

More articles: