Why are these computational speeds different for a multidimensional array in C ++?

This may be a repeated question. Therefore, please feel free to mark this if you wish. In C ++, I found out that array sizes are stored sequentially in memory. How are 3D arrays stored in C? , so I did a little experiment to assign natural numbers to a matrix size of 1600000000x1 and 1x1600000000 (please change the matsize in the code to a lower value depending on your memory). The code below assigns natural numbers from 1 to 1,600,000,000 to the matrix a (whose dimensions are 1x1600000000) and calculates the sum of the cubes of all elements. The opposite case is simply to resize the matrix, which I do by changing xdim to matsize and ydim to 1, and recompiling the code and running it again. Matrix [xdim][ydim]

 #include <iostream> #include <time.h> using namespace std; int main() { long int matsize, i, j, xdim, ydim; long double ss; double** a; double time1, time2, time3; clock_t starttime = clock(); matsize=1600000000; xdim=1; ydim=matsize; ss=0.0; a= new double *[xdim]; for(i=0;i<xdim;i++) { a[i]= new double[ydim]; } time1= (double)( clock() - starttime ) / (double)CLOCKS_PER_SEC; cout << "allocated. time taken for allocation was " << time1 <<" seconds. computation started" << endl; for(i=0;i<xdim;i++) { for(j=0;j<ydim;j++) { a[i][j]=(i+1)*(j+1); ss=ss+a[i][j]*a[i][j]*a[i][j]; } } cout << "last number is " << a[xdim-1][ydim-1] << " . sum is " << ss << endl; time2= ((double)( clock() - starttime ) / (double)CLOCKS_PER_SEC) - time1; cout << "computation done. time taken for computation was " << time2 << " seconds" << endl; for(i=0;i<xdim;i++) { delete [] a[i]; } delete [] a; time3= ((double)( clock() - starttime ) / (double)CLOCKS_PER_SEC) - time2; cout << "deallocated. time taken for deallocation was " << time3 << " seconds" << endl; cout << "the total time taken is " << (double)( clock() - starttime ) / (double)CLOCKS_PER_SEC << endl; cout << "or " << time1+time2+time3 << " seconds" << endl; return 0; } 

My results for two cases:

Case 1: xdim = 1 and ydim = 1600000000

highlighted. the time spent on isolation was 4.5e-05 seconds. the calculation began, the last number is 1.6e + 09. the sum is 1.6384e + 36 the calculation is done. the time taken to calculate was 14.7475 seconds redistributed. the time spent on release was 0.875754 seconds; the total time is 15.6233 or 15.6233 seconds

Case 2: xdim = 1600000000 and ydim = 1

highlighted. time spent on allocation amounted to 56.1583 seconds. the calculation began, the last number is 1.6e + 09. the sum is 1.6384e + 36 the calculation is done. time spent on the calculation was 50.7347 seconds redistributed. the time spent on release was 270.038 seconds; the total time is 320.773 or 376.931 seconds

The sum of the withdrawal is the same in both cases. I can understand that the time taken to allocate and free memory is different in both cases, but why is the calculation time so different if the memory allocation is continuous? What is wrong with this code?

If that matters, I use g ++ in Mountain Lion and compile using the quad-core processor g ++ -std = C ++ 11, i7, 16 GB of RAM.

+4
source share
2 answers

Each separate vector stores content contiguously, including a vector of pointers to vectors, but the address of consecutive calls that you make to a new one is not contiguous, and none of the call vectors does the internal creation of its buffer. So, if you have a huge vector of pointers to tiny vectors, your memory doesnโ€™t touch effectively and you wonโ€™t get good cache hits. If you have a singleton vector for a huge vector, then the memory is contiguous and the cache will work well.

Visually, fast / continuous layout:

 *a--[0] . | . [0][1][2][3][4][5][...] 

Your slow alternative:

 . [0] [0] . \ / *a--[0][1][2][3][4][5][...] . | \ | \ . [0] \[0][0] [0] 

Multidimensional arrays can be created on the stack, for example,

 int x[10][20]; 

In this case, the memory will be contiguous, with the memory in each of x [0], x [1], etc. adjacent. (So, x [0] [0] before x [0] [1] not immediately before x [1] [0].)

To effectively have a continuous multidimensional array on the heap, you must create a new vector with the product of the intended dimensions, and then write a wrapper class that conveniently multiplies dimensions to find a specific element.

+5
source

Computing time is different from caching data. Knowing the locality of the link , the CPU loads data from neighboring addresses when you read the location in memory. He predicts that the next reading will consist of a location just a few bytes in advance from the address you just read.

When an array is allocated as [1][N] , the elements are indeed stored sequentially, so CPU predictions are performed almost all the time. The data you need is almost always available from the processor cache, which is several times faster than the main memory. The CPU continues to load the places ahead where you just read it when it performs the calculations, so downloading new data and adding numbers continues in parallel.

When you toggle sizes around, the numbers you add are no longer in consecutive places. This is because consecutive calls to new will not allocate data in consecutive areas of memory: memory management libraries add a few bytes for accounting purposes and always allocate pieces of memory with a minimum size that is often larger than double . When you request a piece, which is less than the minimum, the selected area is filled. As a result, your double can be up to twenty bytes in the best scenario * - enough to negate the effects of reading ahead from neighboring memory locations. That is why the CPU is forced to wait while it is loading data from another location. This somewhat slows down the calculation.


* In the worst case, values โ€‹โ€‹can be placed arbitrarily far, depending on the distributions and deallocations performed before your code runs.
+3
source

Source: https://habr.com/ru/post/1484953/


All Articles