Removing rows and columns with all zeros

How can I delete rows (rows) and columns in a text file that contains all zeros. For example, I have a file:

1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 

I want to delete the second and fourth rows as well as the second column. The result should look like this:

 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 

I can do it with sed and egrep

  sed '/0 0 0 0/d' or egrep -v '^(0 0 0 0 )$' 

for rows with zeros, but this is too inconvenient for files with thousands of columns. I have no idea how to delete a column with all zeros, the second column is here.

+6
source share
12 answers

Another awk option:

 awk '{show=0; for (i=1; i<=NF; i++) {if ($i!=0) show=1; col[i]+=$i;}} show==1{tr++; for (i=1; i<=NF; i++) vals[tr,i]=$i; tc=NF} END{for(i=1; i<=tr; i++) { for (j=1; j<=tc; j++) { if (col[j]>0) printf("%s%s", vals[i,j], OFS)} print ""; } }' file 

Extended form:

 awk '{ show=0; for (i=1; i<=NF; i++) { if ($i != 0) show=1; col[i]+=$i; } } show==1 { tr++; for (i=1; i<=NF; i++) vals[tr,i]=$i; tc=NF } END { for(i=1; i<=tr; i++) { for (j=1; j<=tc; j++) { if (col[j]>0) printf("%s%s", vals[i,j], OFS) } print "" } }' file 
+2
source

Perl solution. It stores all nonzero rows in print memory at the end, because it cannot determine which columns will be nonzero before processing the entire file. If you get Out of memory , you can only store the line numbers you want to print, and process the file again when printing lines.

 #!/usr/bin/perl use warnings; use strict; my @nonzero; # What columns where not zero. my @output; # The whole table for output. while (<>) { next unless /1/; my @col = split; $col[$_] and $nonzero[$_] ||= 1 for 0 .. $#col; push @output, \@col; } my @columns = grep $nonzero[$_], 0 .. $#nonzero; # What columns to output. for my $line (@output) { print "@{$line}[@columns]\n"; } 
+4
source

Try the following:

 perl -n -e '$_ !~ /0 0 0 0/ and print' data.txt 

Or simply:

 perl -n -e '/1/ and print' data.txt 

Where data.txt contains your data.

On Windows, use double quotes:

 perl -n -e "/1/ and print" data.txt 
+3
source

Instead of storing lines in memory, this version double-checks the file: once, to find the "zero columns" and again to find the "zero rows" and output:

 awk ' NR==1 {for (i=1; i<=NF; i++) if ($i == 0) zerocol[i]=1; next} NR==FNR {for (idx in zerocol) if ($idx) delete zerocol[idx]; next} {p=0; for (i=1; i<=NF; i++) if ($i) {p++; break}} p {for (i=1; i<=NF; i++) if (!(i in zerocol)) printf "%s%s", $i, OFS; print ""} ' file file 
 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 

Ruby program: ruby ​​has a nice transpose array method

 #!/usr/bin/ruby def remove_zeros(m) m.select {|row| row.detect {|elem| elem != 0}} end matrix = File.readlines(ARGV[0]).map {|line| line.split.map {|elem| elem.to_i}} # remove zero rows matrix = remove_zeros(matrix) # remove zero rows from the transposed matrix, then re-transpose the result matrix = remove_zeros(matrix.transpose).transpose matrix.each {|row| puts row.join(" ")} 
+3
source

Together:

 $ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}' 

This does a line check:

 $ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file 1 0 1 1 1 0 1 0 1 0 0 1 

It moves along all the fields of the line. If any of them is "true" (this means not 0), it prints a line ( print ) and breaks on the next line ( next ).

This does a column check:

 $ awk '{l=NR; c=NF; for (i=1; i<=c; i++) { a[l,i]=$i; if ($i) e[i]++ }} END{ for (i=1; i<=l; i++){ for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n" } }' 

Basically, it stores all the data in an array a , l number of rows, c number of columns. e is to save the array if the column has any value other than 0 or not. Then it cycles through and prints all the fields only when the index of the array e , which means that if this column has any nonzero value.

Test

 $ cat a 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 $ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' a | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}' 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 

previous entry:

 $ cat file 1 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 $ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}' 1 1 1 1 1 0 1 0 1 
+2
source

Over my head ...

The problem is the columns. How do I know if there is a column of all zeros until you read the whole file?

I think you need an array of columns, each of which will be a column. You can click on the amount. Array of arrays.

The trick is to skip lines containing all zeros when you read them:

 #! /usr/bin/env perl # use strict; use warnings; use autodie; use feature qw(say); use Data::Dumper; my @array_of_columns; for my $row ( <DATA> ) { chomp $row; next if $row =~ /^(0\s*)+$/; #Skip zero rows; my @columns = split /\s+/, $row; for my $index ( (0..$#columns) ) { push @{ $array_of_columns[$index] }, $columns[$index]; } } # Remove the columns that contain nothing but zeros; for my $column ( (0..$#array_of_columns) ) { my $index = $#array_of_columns - $column; my $values = join "", @{ $array_of_columns[$index] }; if ( $values =~ /^0+$/ ) { splice ( @array_of_columns, $index, 1 ); } } say Dumper \@array_of_columns; __DATA__ 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 

Of course, you can use Array :: Transpose , which will transfer your array, which will greatly simplify the work.

+1
source

The following script also performs two passes. During the first pass, it stores the line numbers of the lines that should be omitted from the output, and the indexes of the columns that should be included in the output. In the second pass, he displays these rows and columns. I think this should provide the maximum amount of memory that can make a difference if you are dealing with large files.

 #!/usr/bin/env perl use strict; use warnings; filter_zeros(\*DATA); sub filter_zeros { my $fh = shift; my $pos = tell $fh; my %nonzero_cols; my %zero_rows; while (my $line = <$fh>) { last unless $line =~ /\S/; my @row = split ' ', $line; my @nonzero_idx = grep $row[$_], 0 .. $#row; unless (@nonzero_idx) { $zero_rows{$.} = undef; next; } $nonzero_cols{$_} = undef for @nonzero_idx; } my @matrix; { my @idx = sort {$a <=> $b } keys %nonzero_cols; seek $fh, $pos, 0; local $. = 0; while (my $line = <$fh>) { last unless $line =~ /\S/; next if exists $zero_rows{$.}; print join(' ', (split ' ', $line)[@idx]), "\n"; } } } __DATA__ 1 0 1 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 

Output:

  1 0 1 1
 1 1 1 1
 0 1 1 1
 1 1 0 0
 0 0 1 1 
+1
source

A small unorthodox solution, but fast, like black and low memory consumption:

 perl -nE's/\s+//g;$m|=$v=pack("b*",$_); push@v ,$v if$v!~/\000/}{$m=unpack("b*",$m);@m=split//,$m;@m=grep{$m[$_]eq"1"}0..$#m;say"@{[(split//,unpack(q(b*),$_))[@m]]}" for@v ' 
+1
source

This is my awk solution. It will work with a variable number of rows and columns.

 #!/usr/bin/gawk -f BEGIN { FS = " " } { for (c = 1; c <= NF; ++c) { v = $c map[c, NR] = v ctotal[c] += v rtotal[NR] += v } fields[NR] = NF } END { for (r = 1; r <= NR; ++r) { if (rtotal[r]) { append = 0 f = fields[r] for (c = 1; c <= f; ++c) { if (ctotal[c]) { if (append) { printf " " map[c, r] } else { printf map[c, r] append = 1 } } } print "" } } } 
+1
source

This is a very difficult and difficult question. Therefore, in order to solve, we also need to be too complicated :) in my version I am dependent on the training script, every time we read a new line, we check the new field's ability to exclude, and if a new change is detected, we will start.

Checking and starting the process should not be repeated as often as we have several rounds until we get a constant number of fields for lowering or zero, then we will lower each value of the zero line in a specific position.

 #! /usr/bin/env perl use strict; use warnings; use Data::Dumper; open my $fh, '<', 'file.txt' or die $!; ##open temp file for output open my $temp, '>', 'temp.txt' or die $!; ##how many field you have in you data ##you can increase this by one if you have more fields my @fields_to_remove = (0,1,2,3,4); my $change = $#fields_to_remove; while (my $line = <$fh>){ if ($line =~ /1/){ my @new = split /\s+/, $line; my $i = 0; for (@new){ unless ($_ == 0){ @fields_to_remove = grep(!/$i/, @fields_to_remove); } $i++; } foreach my $field (@fields_to_remove){ $new[$field] = 'x'; } my $new = join ' ', @new; $new =~ s/(\s+)?x//g; print $temp $new . "\n"; ##if a new change detected start over ## this should repeat for limited time ## as the script keeps learning and eventually stop if ($#fields_to_remove != $change){ $change = $#fields_to_remove; seek $fh, 0, 0; close $temp; unlink 'temp.txt'; open $temp, '>', 'temp.txt'; } } else { ##nothing -- removes 0 lines } } ### this is just for showing you which fields has been removed print Dumper \@fields_to_remove; 

I tested with 9 25 MB data files and it worked fine, it was not very fast, but it also did not consume much memory.

+1
source

My compact and large file compatible alternative using grep and cut. The only drawback: the large size for large files due to the for loop.

 # Remove constant lines using grep $ grep -v "^[0 ]*$\|^[1 ]*$" $fIn > $fTmp # Remove constant columns using cut and wc $ nc='cat $fTmp | head -1 | wc -w' $ listcol="" $ for (( i=1 ; i<=$nc ; i++ )) $ do $ nitem='cut -d" " -f$i $fTmp | sort | uniq | wc -l' $ if [ $nitem -gt 1 ]; then listcol=$listcol","$i ;fi $ done $ listcol2='echo $listcol | sed 's/^,//g'' $ cut -d" " -f$listcol2 $fTmp | sed 's/ //g' > $fOut 
0
source

You can check the lines as follows: awk '/[^0[:blank:]]/' file

It simply indicates whether the string contains any character other than 0 or the <blank> character, and then prints the string.

If you want to check the columns now, then I propose an adaptation of Glenn Jackman 's answer

 awk ' NR==1 {for (i=1; i<=NF; i++) if ($i == 0) zerocol[i]=1; next} NR==FNR {for (idx in zerocol) if ($idx) delete zerocol[idx]; next} /[^0[:blank:]]/ {for (i=1; i<=NF; i++) if (i in zerocol) $i=""; print} ' file file 
0
source

Source: https://habr.com/ru/post/951118/


All Articles