I intended this to be just a comment, but it is too long. Do not worry if you do not want to accept this as an answer or something else.
Firstly, it is commendable that you think about how to match this with the GPU. Many scientists need to take the time to consider such ideas for their algorithms. However, after reading the description, I'm not sure if the GPU is the ideal way to parallelize this particular method. The reason I'm talking about is that GP-GPU programming tends to be very efficient with "batch processing", that is, algorithms that naturally lend themselves to elementary manipulation of elements (stream level).
It is not clear if there is a useful separation of this algorithm along these lines, except for the summation / ranking level that others have already talked about (and these kinds of sub-functions are already well understood on the GPU), but if you zoom out and start trying to think about how you You can use the GPU to speed up the comparison of columns with columns, you can not do much. Knowing places where specific records are less than a given point value will not allow you to avoid doing the same calculations when one or the other columns are changed.
In short, you will have to perform N(N+1)/2 different iterations of the algorithm at the top level, and there is no way to avoid this. Inside each of these algorithms, there is enough space to send column arrays to the GPU and compare at the thread level to quickly calculate various rank statistics.
The best approach would be to use MPI in a multi-processor setup, so that you will handle each high-level iteration for different processors. The parallel master-slave model (terribly termed) seems appropriate if you do not have enough processors and / or if you want each processor to use the GPU within a single high-level iteration. As long as the elements of the upper triangle of the Hoeffding βcovarianceβ matrix that you are trying to calculate continue to exist, you continue to roll back tasks to available processors.
Secondly, I think that Matlab and Jacket have more problems here than you might believe. Yes, Matlab is optimized for some operations with linear algebra, but almost always it is unequivocally slower than any "real" programming language. The trade-off is that you get many convenient features with commercial-level documentation, and this is sometimes useful.
An alternative that I offer you is to use PyCUDA along with mpi4py in the Python programming language. With NumPy and Cython , Python is strictly better and faster than Matlab, even in terms of usability. If you use the PyDev plugin for Eclipse , even the Matlab terminal user interface is basically identical. In addition, you do not need to pay for a license or any additional discount on a jacket.
You will need to work a bit to get PyCUDA and mpi4py to work together, but in the end, I think open-source-ness will make your efforts much more valuable to more people, and the code will probably be much faster.
Also, if you really adhere to Matlab, have you defined existing Fortran code in the case of one pair of columns for thousands of records, but only two columns? If it's fast enough, you can write a wrapper for this Fortran with a mex file. Again, simple Matlab multiprocessing sessions may be all you need.