Like grep inside grep

I have a bunch of massive text files, about 100 MB each.

I want grep to find entries that contain "INDIANA JONES":

$ grep -ir 'INDIANA JONES' ./ 

Then I would like to find entries that contain the word PORTUGAL within the 5000 characters of the term INDIANA JONES . How can I do it?

 # in pseudocode grep -ir 'INDIANA JONES' ./ | grep 'PORTUGAL' within 5000 char 
+6
source share
4 answers

Use the grep -o flag to display 5,000 characters surrounding the match, then search for those characters for the second line. For instance:

 grep -ioE ".{5000}INDIANA JONES.{5000}" file.txt | grep "PORTUGAL" 

If you need the original match, add the -n flag to the second grep and pipe in:

 cut -f1 -d: > line_numbers.txt 

then you can use awk to print these lines:

 awk 'FNR==NR { a[$0]; next } FNR in a' line_numbers.txt file.txt 

To avoid a temporary file, this can be written like this:

 awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" file.txt | grep -n "PORTUGAL" | cut -f1 -d:) file.txt 

For multiple files, use the find and bash loops:

 for i in $(find . -type f); do awk 'FNR==NR { a[$0]; next } FNR in a' <(grep -ioE ".{50000}INDIANA JONES.{50000}" "$i" | grep -n "PORTUGAL" | cut -f1 -d:) "$i" done 
+3
source

Consider installing ack-grep.

 sudo apt-get install ack-grep 

ack-grep is a more powerful version of grep.

There is no trivial solution to your question (what can I think) outside the full batch of script, but you can use the -A and -B flags for ack-grep to indicate the number of trailing or leading lines to output, respectively.

This may not be the number of characters, but it is another step in that direction.

Although this may not be a solution, it may give you some idea of โ€‹โ€‹how to do this. Search filters such as ack, awk, sed, etc., and see if you can find one with a flag for this behavior.

Ack-grep manual:

http://manpages.ubuntu.com/manpages/hardy/man1/ack-grep.1p.html

EDIT:

I think the sad news is that you might be thinking what you're looking for is something like:

 grep "\(INDIANA JONES\).\{1,5000\}PORTUGAL" filename 

The problem is that even in a small file, a request for this issue will not be possible in time. I got this to work with a different number. this is a size issue.

For such a large set of files, you need to do this more than one step.

Decision:

The only solution I know of is the lead and end result from ack-grep.

Step 1: how long are your lines?

If you knew how many lines you had to go through (and you could evaluate / calculate this in several ways), then you can grep the output of the first grep. Depending on what's in your file, you should have a decent upper bound on how many lines are 5,000 characters (if a line has 100 characters on average, 50+ lines should cover you, but if it has 10 characters, it will take 500+ )

You must determine the maximum number of lines, which can be 5000 characters. You can guess or choose a large range if you want, but it is up to you. This is your data.

In this case, call: (if you need 100 lines for 5000 characters)

 ack-grep -ira "PORTUGAL" -A 100 -B 100 filename 

and

 ack-grep -ira "INDIANA JONES" -A 100 -B 100 filename 

replace 100 with what you need.

Step 2: analyze the output

you will need to take matches that return ack-grep and parse them, looking for any matches again within these subranges.

Look for INDIANA JONES in the first issue of PORTUGAL ack-grep match and look for PORTUGAL in the second set of matches.

It will take a little more work, probably involving a bash script (I can see if I can get this work this week), but it solves your massive data problem by breaking it into more manageable chunks.

+1
source

One way to handle this is . You can set the record separator as INDIANA JONES or PORTUGAL , and then check the record length (after deleting new lines, if new lines are not counted to the limit of 5000). You may have to resort to to run this recursively in the directory

 awk -v RS='INDIANA JONES|PORTUGAL' '{a = $0; gsub("\n", "", a)}; ((RT ~ /IND/ && prevRT ~/POR/) || (RT ~ /POR/ && prevRT ~/IND/)) && length(a) < 5000{found=1}; {prevRT=RT}; END{if (found) print FILENAME}' file.txt 
+1
source

grep 'INDIANA JONES' . -iR -l | while read filename; do head -c 5000 "$filename" | grep -n PORTUGAL -H --label="$filename" ; done

This works as follows:

  • grep 'INDIANA JONES' . -iR -l grep 'INDIANA JONES' . -iR -l . Locate all files in or below the current directory. Case insensitive ( -i ). And just print the file names that match ( -l ), don't print any content.
  • | while read filename; do ...|...|...; done | while read filename; do ...|...|...; done for each line of input, save it in the variable $filename and execute the pipeline.

Now, for each file that matches "INDIANA JONES", we do

  • head -c 5000 "$filename" - extract the first 5000 characters
  • grep ... - find PORTUGAL. Print the file name ( -H ), but where will we tell us the "file name" that we want to use with --label="$filename" . Also type line numbers, -n .
0
source

Source: https://habr.com/ru/post/957716/


All Articles