How to remove duplicate domains from a large list of URLs? RegEx or otherwise

I originally asked this question: Regular expression in gVim to remove duplicate domains from a list

However, I understand that I am more likely to find a working solution if I "expand the scope of my business" in terms of what decision I want to make.

So, I rephrase my question and maybe I will get a better solution ... here goes:

I have a large list of URLs in a TXT file (I use Windows Vista 32bit), and I need to remove DOMAINS duplicates (and all the corresponding URL for each duplicate), leaving after each domain appears for the first time, This particular file contains approximately 6,000,000 URLs in the following format (the URLs obviously have no place in them, I just had to do this because I don’t have enough messages to publish a lot of β€œlive” URLs):

http://www.exampleurl.com/something.php
http://exampleurl.com/somethingelse.htm  
http://exampleurl2.com/another-url  
http://www.exampleurl2.com/a-url.htm  
http://exampleurl2.com/yet-another-url.html  
http://exampleurl.com/  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

, :

http://www.exampleurl.com/something.php  
http://exampleurl2.com/another-url  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

, , .

- , - , , .

, , -, Windows, , , Windows, " ", ( - ).

+3
4

Python, . - , , , .

import re

pattern = re.compile(r'(http://?)(w*)(\.*)(\w*)(\.)(\w*)')
urlsFile = open("urlsin.txt", "r")
outFile = open("outurls.txt", "w")
urlsDict = {}

for linein in urlsFile.readlines():
    match = pattern.search(linein)
    url = match.groups()
    domain = url[3]
    urlsDict[domain] = linein

outFile.write("".join(urlsDict.values()))

urlsFile.close()
outFile.close()

, , , . 6 URL- Python...

, , : " , ". . - , comp.emacs.xemacs

+2

Regex. URL - BCL: Uri. , .

public List<string> GetUrlWithUniqueDomain(string file) {
  using ( var reader = new StreamReader(file) ) {
    var list = new List<string>();
    var found = new HashSet<string>();
    var line = reader.ReadLine();
    while (line != null) {
      Uri uri;
      if ( Uri.TryCreate(line, UriKind.Absolute, out uri) && found.Add(uri.Host)) {
        list.Add(line);
      }
      line = reader.ReadLine();
    }
  }
  return list;
}
+1

Perl regexps. i

   use warnings ;
   use strict ;
   my %seen ;
   while (<>) {
       if ( m{ // ( .*? ) / }x ) {
       my $dom = $1 ;

       print unless $seen {$dom} ++ ;
       print "$dom\n" ;
     } else {
       print "Unrecognised line: $_" ;
     }
   }

www.exampleurl.com exampleurl.com .

if ( m{ // (?:www\.)? ( .*? ) / }x )

"www". . , , , regexp, .

, regexp ( /x). , - .

           if ( m{
               //          # match double slash
               (?:www\.)?  # ignore www
               (           # start capture
                  .*?      # anything but not greedy
                )          # end capture
                /          # match /
               }x ) {

m{} //, /\/\/

+1
  • Find a unix box if you don't have one, or get cygwin
  • use tr to convert '.' for TAB for convenience.
  • use sort (1) to sort the strings by part of the domain name. This can be made a little easier by writing an awk program to normalize the www part.

And you, you have doubles. Use perhaps the use of uniq (1) to find duplicates.

(Extra credit: why can't regular expression do this? Computer science students should think about pumping lemmas.)

0
source

Source: https://habr.com/ru/post/1771033/


All Articles