I have a piece of code that runs on a large matrix and calculates statistics broken down by columns, where the bins are given in vector b.
The code goes (something) as follows:
for (item = 0; item < items; item++) {
uint8 bin = binvec[item];
for (col = 0; col < columns; col++) {
int idx = item * items_stride + col * cols_stride;
uint8 val = matrix[idx];
float x = matrix2[idx];
count[bin][val][col] += x;
}
}
Suppose the number of columns is known at compile time. The values matrixdo not have a specific structure / order - they take pure random values. the data size is quite large: several million elements and hundreds of columns.
Looking at the code, I assume that the best performance will be achieved if:
matrix is the main line for better cache locality.countwill be available as count[bin][col][val], therefore, address calculation count[bin][col]can be optimized, which will simplify prefetching, etc.
, matrix count , .
(1) (2) 50% .
, ..
, ? .