Grep does not work well on large files, is there an alternative?

I have a diff that is essentially equivalent to either extra unique lines or lines that move in the file, and therefore their line numbers have changed. To determine what is a truly new addition, I run this small perl snippet to separate the “allowed” lines from the “unresolved” lines:

perl -n -e' /^\-([^\-].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDOUT "$1\n"; next; }; /^\+([^\+].*?)\([^,\(]+,\d+,\d+\).*$/ && do { print STDERR "$1\n"; next; }; ' "$delta" 1>resolved 2>unresolved 

This is actually pretty fast, and it does the job, dividing the differentiation of lines between 6000+ into two more than 3000 linear files, removing any links to line numbers and unified markup. Next is the grep command, which seems to work on 100% CPU for almost 9 minutes (real):

 grep -v -f resolved unresolved 

This is essentially the removal of all allowed lines from an unresolved file. Exit after 9 minutes by coincidence - 9 lines of output - unique additions or unresolved lines.

Firstly, when I used grep in the past, it was pretty good, so why is it exceptionally slow in this case and the processor is hungry?

Secondly, is there a more efficient alternative way to delete lines from one file that is contained in another?

+6
source share
2 answers

Grep probably parses this file completely for every match it was told was found. You can try fgrep if it exists on your system, or grep -F if it doesn't, which forces grep to use the Aho-Corasick string matching algorithm ( http://en.wikipedia.org/wiki/Aho%E2 % 80% 93Corasick_string_matching_algorithm ), which tries to combine all the lines at once, only requiring one run of the file.

+5
source

If the lines that must match two files must be exact matches, you can use sort and uniq to do the job:

 cat resolved resolved unresolved | sort | uniq -u 

Only non-duplicated lines in the above pipeline will be unresolved lines that are not allowed. Note that it is important to specify twice in the cat command: otherwise uniq also highlights lines that are unique to this file. This assumes that allowed and unresolved did not have duplicate rows to start with. But it's pretty easy to handle: just compile and uniq them first

 sort resolved | uniq > resolved.uniq sort unresolved | uniq > unresolved.uniq 

Also, I found that fgrep would be significantly faster if I try to match fixed strings, so this may be an alternative.

+8
source

Source: https://habr.com/ru/post/977786/


All Articles