Data frame (matrix) performance: memory layout

Question

Data frame (matrix) performance: memory layout

I'm new to R. Suppose the memory layout is the same for the data frame and matrix.

In the next matrix

a = matrix (1: 10000000,1000000,10)

It has 1M rows and 10 columns. Is the memory for a row or for a column consistent physically? Or the first storage of physical memory [1,1], [2,1], [3,1], [1M, 1], [2,1] or [1,2], [1,2], .. [ 1.10], [2.1] ...?

Assume that the matrix with the 10M element is 100M in size and the L2 cache is 4M, then the L2 cache cannot store all of these 10M elements. If we process the data sequentially, we will have less L2 cache absence coefficient. In our case, we need to process line by line and read several columns at the same time, for example, columns A, B, C, and then create some results. If the memory layout first stores 10 items in the 1st row, then save 10 items in the 2nd row, then performance might be better.

If there is a way to control the layout of the memory?

+3

r

Daniel Wu Jan 19 '11 at 9:15

source share

2 answers

:

> m=matrix(1:12,nrow=3)
> m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

- , . , .

, R- , . , . , .

+6

Spacedman 19 . '11 10:25

Joshua Ulrich · Accepted Answer · 2011-01-19T15:30:04+0000

A matrix is just a vector with an attribute dim. Matrix elements are stored in a vector in the main column order. Unable to change this.

, , , .

> set.seed(21)
> a = matrix(rnorm(1e6),1e3,1e3)
> ta = t(a)
> system.time(for(i in 1:1000) colSums(ta))
   user  system elapsed 
   1.39    0.00    1.40 
> system.time(for(i in 1:1000) rowSums(a))
   user  system elapsed 
   2.40    0.00    2.39 
> identical(rowSums(a), colSums(ta))
[1] TRUE

, colSums, rowSums, colMeans rowMeans do_colsum src/main/array.c.

Data frame (matrix) performance: memory layout

More articles: