Perl: programming efficiency in calculating correlation coefficients for a large dataset

EDIT: the link should work now, sorry for the trouble

I have a text file that looks like this:

Name, Test 1, Test 2, Test 3, Test 4, Test 5
Bob, 86, 83, 86, 80, 23
Alice, 38, 90, 100, 53, 32
Jill, 49, 53, 63, 43, 23.

I am writing a program that gave this text file, it will create a table of Pearson correlation coefficients that looks like where the entry (x, y) is the correlation between person x and face y:

Name, Bob, Alice, Jill
Bob, 1, 0.567088412588577, 0.899798494392584
Alice, 0.567088412588577, 1, 0.812425393004088
Jill, 0.899798494392584, 0.812425393004088, 1

My program works, except that the dataset I load has 82 columns and, more importantly, 54,000 rows. When I run my program right now, it is incredibly slow and I get an error from memory. Is there a way that I can do, first of all, remove any possibility of an error from memory and, possibly, make the program more efficient? The code is here: code .

Thanks for your help,
Jack.

Edit: In case someone else is trying to perform a large-scale calculation, convert your data to hdf5 format. Here is what I did to solve this problem.

+3
source share
7 answers

54000 ^ 2 * 82 . , . ? . , , , .

+4
+4

CPAN? gsl_stats_correlation . Math:: GSL:: Statisics. GNU.

gsl_stats_correlation ($ data1, $stride1, $data2, $stride2, $n). $data1 $data2, $n. r = cov (x, y)/(\ Hat\sigma_x\Hat\sigma_y) = {1/(n-1)\sum (x_i -\Hat x) (y_i -\Hat y)\over\sqrt {1/(n-1)\sum (x_i -\Hat x) ^ 2}\sqrt {1/(n-1)\sum (y_i -\Hat y) ^ 2}}

+4

PDL:

PDL ( " Perl" ) Perl N- ,

.

+3

: , . , .

: perl 5.10.0, perl (cf. perlmonks thread).

:

, .

! - :

open FILE, ">", "file.txt" or die $!;
print FILE "Name, ", join(", ", 0..$#{$correlations[0]}+1), "\n";
my $rowno = 1;
foreach my $row (@correlations) {
  print FILE "$rowno, " . join(", ", @$row) . "\n";
  $rowno++;
}
close FILE;

, Perl , , , , ++ iostreams ( ) .

, . .

+2

, , , Statistics::LSNoHistory, , pearson_r, .

+1

In addition to the above comment on PDL, here is a code that efficiently calculates the correlation table even for very large datasets:

use PDL::Stats; # this useful module can be downloaded from CPAN
my $data = random(82, 5400); # your data should replace this
my $table = $data->corr_table(); # that all, really

You may need to set $PDL::BIGPDL = 1;your script in the header and make sure that you run it on a machine with lots of memory. The calculation itself is fast enough, the 82 x 5400 table took only a few seconds on my laptop.

0
source

Source: https://habr.com/ru/post/1705203/


All Articles