Identifying Subarrays in Matrices in Perl

I am relatively new to Perl, and I need to do relatively complex mathematical calculations and don’t know which data structures to use.

Not sure if this is the right forum for this, but say you have the following matrix in a multidimensional array in Perl:

0.2 0.7 0.2 0.6 0.8 0.7 0.6 0.1 0.8 0.1 0.2 0.9 0.6 0.3 0.0 0.6 0.9 0.2 

I am trying to identify the column segments in this matrix corresponding to continuous values ​​exceeding a given threshold, for example. 0.5

For example, if we generate this matrix, we have:

 0 1 0 1 1 1 1 0 1 0 0 1 1 0 0 1 1 0 

If we now focus on the first column:

 0 1 1 0 1 1 

it can be seen that there are two continuous segments:

0 1 1 0 1 1

  • The first track (a sequence of them) starts at index 1 and ends at index 2
  • The second track (a sequence of them) starts at index 4 and ends at index 5

I would like to find all such tracks in the source matrix, but I don’t know how to continue or which Perl data structures are most suitable for this.

Ideally, I would like something easy to index, for example. assuming we use the tracks variable, I can store the indices for the first column (index 0) as follows:

 # First column, first track $tracks{0}{0}{'start'} = 1; $tracks{0}{0}{'end'} = 2; # First column, second track $tracks{0}{1}{'start'} = 4; $tracks{0}{1}{'end'} = 5; # ... 

What are some good data structures and / or libraries that I can use to solve this problem in Perl?

+4
source share
3 answers

It looks like what you want. I presented the data in the form you proposed, since the ideal form depends entirely on what you want to do with the result

It works by calculating a list of 0s and 1s from each column, adding zero barriers at each end (one in $prev and one in the for list), and then looking at the list for changes between 1 and 0

Each time a change occurs, the beginning or end of the track is recorded. If $start undefined, then the current index is written as the beginning of the segment, otherwise the current segment ends one less than the current index. The hash is built using the start and end keys and is placed in the @segments array.

The final set of nested loops unloads the calculated data in the form displayed in the question

 use strict; use warnings; use constant THRESHOLD => 0.5; my @data = ( [ qw/ 0.2 0.7 0.2 / ], [ qw/ 0.6 0.8 0.7 / ], [ qw/ 0.6 0.1 0.8 / ], [ qw/ 0.1 0.2 0.9 / ], [ qw/ 0.6 0.3 0.0 / ], [ qw/ 0.6 0.9 0.2 / ], ); my @tracks; for my $colno (0 .. $#{$data[0]}) { my @segments; my $start; my $prev = 0; my $i = 0; for my $val ( (map { $_->[$colno] > THRESHOLD ? 1 : 0 } @data), 0 ) { next if $val == $prev; if (defined $start) { push @segments, { start => $start, end=> $i-1 }; undef $start; } else { $start = $i; } } continue { $prev = $val; $i++; } push @tracks, \@segments; } # Dump the derived @tracks data # for my $colno (0 .. $#tracks) { my $col = $tracks[$colno]; for my $track (0 .. $#$col) { my $data = $col->[$track]; printf "\$tracks[%d][%d]{start} = %d\n", $colno, $track, $data->{start}; printf "\$tracks[%d][%d]{end} = %d\n", $colno, $track, $data->{end}; } print "\n"; } 

Output

 $tracks[0][0]{start} = 1 $tracks[0][0]{end} = 2 $tracks[0][1]{start} = 4 $tracks[0][1]{end} = 5 $tracks[1][0]{start} = 0 $tracks[1][0]{end} = 1 $tracks[1][1]{start} = 5 $tracks[1][1]{end} = 5 $tracks[2][0]{start} = 1 $tracks[2][0]{end} = 3 
+1
source

I just give an algorithmic answer, and you can encode it in any language that you like.

Divide the problem into subtasks:

  • Threshold: depending on how you store the input, it can be as simple as iterating over a $ n $ dimensional matrix or traversing a tree / list if your matrices are sparse. This is an easy bit.

  • The continuous segment search algorithm is called "path length coding". It takes a sequence with possible duplicates, such as 1 0 0 1 1 1 1 0 1 and returns another sequence that tells you which element is next and how many of them are. So, for example, the above sequence will be 1 1 0 2 1 4 0 1 1 1. The encoding is unique, so if you ever want to invert it, you're fine.

The first 1 is, because the original input starts at 1, and the first 0 is, because after 1 there is 0, and the fourth number is two, because there are two consecutive zeros. If you do not want to make your own, there are zillions rle-encoders. Its main purpose is compression, and for this purpose it works quite well if you have long lines of identical elements. Depending on your needs, you can run it horizontally, vertically and even diagonally.

You will find the exact algorithm in all classic books on data structures and algorithm. I would suggest Cormen-Leiseron-Rivest-Stein: first an Introduction to Algorithms, then a Whip.

Once you get the gist, you can safely "merge" the threshold value with RLE to avoid repeating twice over your inputs.

+2
source

Based on the poor support for multidimensional arrays in Perl, I soon discovered that I was building a small solution myself. The algorithm is quite similar to Borodins idea, but has a slightly different structure:

 sub tracks { my ($data) = @_; # this sub takes a callback as argument my @tracks; # holds all found ranges my @state; # is true if we are inside a range/track. Also holds the starting index of the current range. my $rowNo = 0; # current row number while (my @row = $data->()) { # fetch new data for my $i (0..$#row) { if (not $state[$i] and $row[$i]) { # a new track is found $state[$i] = $rowNo+1; # we have to pass $rowNo+1 to ensure a true value } elsif ($state[$i] and not $row[$i]) { push @{$tracks[$i]}, [$state[$i]-1, $rowNo-1]; # push a found track into the @tracks array. We have to adjust the values to revert the previous adjustment. $state[$i] = 0; # reset state to false } } } continue {$rowNo++} # flush remaining tracks for my $i (0..$#state) { push @{$tracks[$i]}, [$state[$i]-1, $rowNo-1] if $state[$i] } return @tracks; } 

@state doubles as a flag indicating whether we are inside the track and as an entry for the track start index. In an array of states and tracks, the index indicates the current column.

I used an external file as a data source, but it can be easily connected to anything, for example. existing array. The only contract is that it must return an arbitrary sequence of true and false values ​​and an empty list when there is no data available.

 my $limit = 0.5 my $data_source = sub { defined (my $line = <>) or return (); # return empty list when data is empty chomp $line; return map {$_ >= $limit ? $_ : 0} split /\s+/, $line; # split the line and map the data to true and false values }; 

With the data that you insert as input, I get the following printout as output (print code omitted):

 [ [1 2], [4 5] ] [ [0 1], [5 5] ] [ [1 3] ] 

With your structure it will be

 $tracks[0][0][0] = 1; $tracks[0][0][1] = 2; $tracks[0][1][0] = 4; ...; 

If this is modified for a hash, additional data may be added, such as the original value.

+1
source

Source: https://habr.com/ru/post/1433891/


All Articles