Using awk to search for a domain name containing the longest repeating word

For example, let's say there is a file named domains.csvwith the following:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org

I am trying to use linux awk regex expressions to find a string containing the longest repeating word 1 so in this case it will return a string

5,letswelcomewelcomeyou.org

How to do it?

1 The value "repeats immediately", i.e. abcabcbut not abcXabc.

+4
source share
2 answers

A pure awk implementation would be quite long, since awk regular expressions do not have backlinks, the use of which simplifies the approach quite a bit.

:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org

:

cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile

, , , :

  • , :

    $ cut -d ',' -f 2 infile
    helloguys.ca
    byegirls.com
    hellohelloboys.ca
    hellobyebyedad.com
    letswelcomewelcomeyou.org
    letscomewelcomewelyou.org
    
  • , :

    ... | grep -Eo '(.*)\1'
    ll
    hellohello
    ll
    byebye
    welcomewelcome
    comewelcomewel
    
  • :

    ... | awk '{ print length(), $0 }'
    2 ll
    10 hellohello
    2 ll
    6 byebye
    14 welcomewelcome
    14 comewelcomewel
    
  • , , :

    ...| sort -k 1,1 -nr
    14 welcomewelcome
    14 comewelcomewel
    10 hellohello
    6 byebye
    2 ll
    2 ll
    
  • , () , :

    ... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
    welcomewelcome
    comewelcomewel
    
  • grep, -f -, stdin :

    ... | grep -f - infile
    5,letswelcomewelcomeyou.org
    6,letscomewelcomewelyou.org
    

bbwelcomewelcome, , , welwelcomewelcome, welwel, welcomewelcome.

awk, sort

tripleee , , sort awk sort awk, , :

$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile

awk , :

{
    # New longest match: throw away stored longest matches, reset index
    if (length() > max_len) {
        max_len = length()
        delete arr_longest
        idx = 1
    }

    # Add line to longest matches
    if (length() >= max_len)
        arr_longest[idx++] = $0
}

# Print all the longest matches
END {
    for (idx in arr_longest)
        print arr_longest[idx]
}

- , :

  • ( sort awk):

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    1m55.742s
    user    1m57.873s
    sys     0m0.045s
    
  • ( awk, sort):

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    1m55.603s
    user    1m56.514s
    sys     0m0.045s
    
  • Perl- :

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    0m5.249s
    user    0m5.234s
    sys     0m0.000s
    

: Perl ;)

, , ( head -1 awk awk ), , , .

-, BSD grep grep -f - stdin. , , grep -f.

+6

perl:

perl -F, -ane 'if (@m=$F[1]=~/(?=(.+)\1)/g) {
    @m=sort { length $b <=> length $a} @m;
    $cl=length @m[0];
    if ($l<$cl) { @res=($_); $l=$cl; } elsif ($l==$cl) { push @res, ($_); }
}
END { print @res; }' file

, , , (@m[0]).

($cl) ( ). , , , .

:

:

-F, ,
-ane (e , n $_, a autosplit, FS @F),

:

/
(?=         # open a lookahead assertion
    (.+)\1  # capture group 1 and backreference to the group 1
)           # close the lookahead
/g # all occurrences 

, . , , lookahead ( lookhhead ", ", ). , , , , . lookahead , ( , 1 ).

+6

Source: https://habr.com/ru/post/1628910/


All Articles