Using awk to search for a domain name containing the longest repeating word

Question

Using awk to search for a domain name containing the longest repeating word

For example, let's say there is a file named domains.csvwith the following:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org

I am trying to use linux awk regex expressions to find a string containing the longest repeating word ¹ so in this case it will return a string

5,letswelcomewelcomeyou.org

How to do it?

¹ The value "repeats immediately", i.e. abcabcbut not abcXabc.

+4

bash regex awk

cullan Feb 15 '16 at 23:25

source share

2 answers

perl:

perl -F, -ane 'if (@m=$F[1]=~/(?=(.+)\1)/g) {
    @m=sort { length $b <=> length $a} @m;
    $cl=length @m[0];
    if ($l<$cl) { @res=($_); $l=$cl; } elsif ($l==$cl) { push @res, ($_); }
}
END { print @res; }' file

, , , (@m[0]).

($cl) ( ). , , , .

:

-F, ,
-ane (e , n $_, a autosplit, FS @F),

:

/
(?=         # open a lookahead assertion
    (.+)\1  # capture group 1 and backreference to the group 1
)           # close the lookahead
/g # all occurrences

, . , , lookahead ( lookhhead ", ", ). , , , , . lookahead , ( , 1 ).

+6

Casimir et Hippolyte 16 . '16 0:50

Benjamin W. · Accepted Answer · 2016-02-15T23:58:33+0000

A pure awk implementation would be quite long, since awk regular expressions do not have backlinks, the use of which simplifies the approach quite a bit.

:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org

:

cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile

, , , :

, :

$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org

, :

... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel

:

... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel

, , :

...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll

, () , :

... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel

grep, -f -, stdin :

... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org

bbwelcomewelcome, , , welwelcomewelcome, welwel, welcomewelcome.

awk, sort

tripleee , , sort awk sort awk, , :

$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile

awk , :

{
    # New longest match: throw away stored longest matches, reset index
    if (length() > max_len) {
        max_len = length()
        delete arr_longest
        idx = 1
    }

    # Add line to longest matches
    if (length() >= max_len)
        arr_longest[idx++] = $0
}

# Print all the longest matches
END {
    for (idx in arr_longest)
        print arr_longest[idx]
}

- , :

( sort awk):

964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com

real    1m55.742s
user    1m57.873s
sys     0m0.045s

( awk, sort):

964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com

real    1m55.603s
user    1m56.514s
sys     0m0.045s

Perl- :

964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com

real    0m5.249s
user    0m5.234s
sys     0m0.000s

: Perl ;)

, , ( head -1 awk awk ), , , .

-, BSD grep grep -f - stdin. , , grep -f.

Using awk to search for a domain name containing the longest repeating word

More articles: