Remove duplicate domains from a list using regular expressions

I would like to use PCRE to take a list of URIs and separate it.

Start

http://abcd.tld/products/widget1       
http://abcd.tld/products/widget2    
http://abcd.tld/products/review    
http://1234.tld/

Done

http://abcd.tld/products/widget1
http://1234.tld/

Any ideas, dear StackOverflow members?

+1
source share
5 answers

You can use simple tools like uniq .

See the kobi example in the comments:

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq
+5
source

While INSANELY is inefficient, this can be done ...

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

Please do not use this

+2
source

URI, . URL-, , .

Ruby:

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links
+1
source

If you can work with the whole file as one line, and not from line to line, then why this should not be like this work. (I'm not sure about char ranges.)

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!
0
source

if you have (g) awk on your system

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

Output

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/
0
source

Source: https://habr.com/ru/post/1771040/


All Articles