Linux: reading line by line and printing line by line

I'm new to shell scripts, it would be great if I can help with the question below.

I want to read a text file line by line and print all the matching patterns in this line in line in a new text file.

For instance:

$ cat input.txt SYSTEM ERROR: EU-1C0A Report error -- SYSTEM ERROR: TM-0401 DEFAULT Test error SYSTEM ERROR: MG-7688 DEFAULT error -- SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error -- ERROR: MG-3218 error occured in HSSL SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error SYSTEM ERROR: EU-1C0A error Failed to fill in test report -- ERROR: MG-7688 

The suggested conclusion is as follows:

 $ cat output.txt EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 

I tried the following code:

 while read p; do grep -o '[AZ]\{2\}-[A-Z0-9]\{4\}' | xargs done < input.txt > output.txt 

which produced this result:

 EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 ....... 

Then I also tried this:

 while read p; do grep -o '[AZ]\{2\}-[A-Z0-9]\{4\}' | xargs > output.txt done < input.txt 

But it did not help :(

Maybe there is another way: am I open to awk / sed / cut or something else ... :)

Note. . There can be any number of error codes (i.e. XX: XXXX, sample of interest on one line).

+6
source share
8 answers

There is always perl! And it will capture any number of matches on each line.

 perl -nle '@matches = /[AZ]{2}-[A-Z0-9]{4}/g; print(join(" ", @matches)) if (scalar @matches);' output.txt 

-e perl code that must be executed by the compiler and -n run one line at a time and -l automatically moves the line and adds a new line for printing. A.

The regular expression implicitly matches $_ . So @matches = $_ =~ //g too verbose.

If there is no match, it does not print anything.

+4
source
 % awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 

Long form explanation:

 awk ' BEGIN{ RS=": " } # Set the record separator to colon-space NR>1 { # Ignore the first record printf("%s%s", # Print two strings: $1, # 1. first field of the record (`$1`) ($0~/\n/) ? "\n" : " ") # Ternary expression, read as `if condition (thing # between brackets), then thing after `?`, otherwise # thing after `:`. # So: If the record ($0) matches (`~`) newline (`\n`), # then put a newline. Otherwise, put a space. } ' input.txt 

Previous answer to the unmodified question:

 % awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, (NR%2==1)?"\n":" "}' input.txt EU-1C0A TM-0401 MG-7688 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 

edit: with protection from : -injection (thanks @ e0k). Checks that the first field after the record separator looks as we expect.

 awk 'BEGIN{RS=": "};NR>1 && $1 ~ /^[AZ]{2}-[A-Z0-9]{4}$/ {printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt 
+5
source

You can always do this very simply:

 $ awk '{o=""; for (i=1;i<=NF;i++) if ($i=="ERROR:") o=o$(i+1)" "; print o}' input.txt EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 

The above will add a char space at the end of each line, trivial to avoid if you are wondering ...

+2
source

To save the grep template, follow these steps:

 while IFS='' read -rp; do echo $(grep -o '[AZ]\{2\}-[A-Z0-9]\{4\}' <<<"$p") done < input.txt > output.txt 
  • while IFS='' read -rp; do while IFS='' read -rp; do is the standard way to read line by line into a variable. See, for example, this answer .
  • grep -o '[AZ]\{2\}-[A-Z0-9]\{4\}' <<<"$p" starts your grep and prints matches. <<<"$p" is a "here string" that provides the string $p (the string that was read) as stdin to grep . This means that grep will search for the contents of $p and print each match on its own line.
  • echo $(grep ...) converts newline characters to grep output into spaces and appends a newline at the end. Since this loop is executed for each line, the result is to print each line of input in one line of output.
  • done < input.txt > output.txt correct: you provide input and output from the loop as a whole. You do not need redirection inside the loop.
+1
source

Another solution that works if you know that each row will contain exactly two instances of the rows you want to match:

 cat input.txt | grep -o '[AZ]\{2\}-[A-Z0-9]\{4\}' | xargs -L2 > output.txt 
+1
source

Here is an awk solution that is pretty straight forward, but it's not an elegant one-liner (usually many awk solutions). It should work with any number of error codes on each line and with an error code defined as a field (space separated by space) that matches the given regular expression. Since this is not a fun single-line font, I saved the program in a file:

codes.awk

 #!/usr/bin/awk -f { m=0; for (i=1; i<=NF; ++i) { if ( $i ~ /^[AZ][AZ]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]$/ ) { if (m>0) printf OFS printf $i m++ } } if (m>0) printf ORS } 

You would run it like

 $ awk -f codes.awk input.txt 

Hope you find it easy to read. It runs a block once for each line of input. It iterates over each field and checks to see if it matches the regular expression, then prints the field if it is executed. The variable m keeps track of the number of matching fields in the current row. The purpose of this is to print the OFS output field separator (default space) between matched fields only as needed and use the ORS output record separator (new default line) only if at least one error code has been found. This prevents unnecessary empty space.

Please note that I changed your regular expression from [AZ]{2}-[A-Z0-9]{4} to [AZ][AZ]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9] . This is because the old awk will not (or at least cannot) support interval expressions (parts of {n} ). However, you can use [AZ]{2}-[A-Z0-9]{4} with gawk . You can customize the regular expression as needed. (In both awk and gawk, regular expressions are limited to / .)

The regular expression /[AZ]{2}-[A-Z0-9]{4}/ will match any field containing a pattern of XX-XXXX letters and numbers. You want the field to be a complete match with the regular expression, and not just include something that matches this pattern. To do this, ^ and $ mark the beginning and end of the line. For example, /^[AZ]{2}-[A-Z0-9]{4}$/ (with gawk) would match US-BOTZ , but not USA-ROBOTS . Without ^ and $ , USA-ROBOTS will match because it includes the SA-ROBO substring that matches the regular expression.

+1
source

Parsing grep -n with AWK

 grep -n -o '[AZ]\{2\}-[A-Z0-9]\{4\}' file | awk -F: -vi=0 '{ printf("%s%s", i ? (i == $1 ? " " : "\n") : "", $2) i = $1 }' 

The idea is to combine the lines with the output of grep -n :

 1:EU-1C0A 1:TM-0401 2:MG-7688 2:DN-0A00 2:DN-0A52 2:MG-3218 3:DN-0A00 3:DN-0A52 4:EU-1C0A 4:MG-7688 

by line numbers. AWK initializes the field separator ( -F: and the variable i ( -vi=0 ), then processes the output of the grep command line.

It prints a character based on a conditional expression that checks the value of the first field of $1 . If i is zero (first iteration), it prints only the second field $2 . Otherwise, if the first field is i , it prints a space, otherwise a new line ( "\n" ). After a space / new line, a second field will be printed.

After printing the next fragment, the value of the first field is stored in i for the following iterations (lines): i = $1 .

Perl

Parsing grep -n in Perl

 use strict; use warnings; my $p = 0; while (<>) { /^(\d+):(.*)$/; print $p == $1 ? " " : "\n" if $p; print $2; $p = $1; } 

Usage: grep -n -o '[AZ]\{2\}-[A-Z0-9]\{4\}' file | perl script.pl grep -n -o '[AZ]\{2\}-[A-Z0-9]\{4\}' file | perl script.pl .

Single line

But Perl is actually so flexible and powerful that you can completely solve the problem with a single line:

 perl -lne 'print @_ if @_ = /([AZ]{2}-[AZ\d]{4})/g' < file 

I saw a similar solution in one of the answers here. However, I decided to publish it as it is more compact.

One of the key ideas is to use the -l switch, which

  • automatically compresses the input separator $/ ;
  • sets the output record separator $\ to $/ (by default, this is a new line)

The separator value of the output record, if specified, is printed after the last argument passed to print . As a result, the script prints all matches ( @_ in particular), followed by a new line.

The variable @_ usually used as an array of subroutine parameters. I used it in a script just for the sake of brevity.

+1
source

In Gnu awk. Supports multiple matches for each entry:

 $ awk ' { while(match($0, /[AZ]{2}-[A-Z0-9]{4}/)) { # find first match on record b=b substr($0,RSTART,RLENGTH) OFS # buffer the match $0=substr($0,RSTART+RLENGTH) # truncate from start of record } if(b!="") print b # print buffer if not empty b="" # empty buffer }' file EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 

Bottom side: an additional OFS will be added at the end of each printed record.

If you want to use other awks than Gnu awk, replace the match regular expression with:

 while(match($0, /[AZ][AZ]-[A-Z0-9][A-Z0-9][A-Z0-9]/)) 
0
source

Source: https://habr.com/ru/post/1013043/


All Articles