What determines the size of a stored object in R?

When I save an object from R using save (), what determines the size of the saved file? Obviously, this is not the same (or close) object size defined by object.size ().

Example: I read a data frame and saved it using

snpmat=read.table("Heart.txt.gz",header=T) save(snpmat,file="datamat.RData") 

The datamat.RData file size is 360 MB.

 > object.size(snpmat) 4998850664 bytes #Much larger 

Then I did some regression analysis and got another adj.snpmat data frame of the same size (6820000 rows and 80 columns).

 > object.size(adj.snpmat) 4971567760 bytes 

I save it with

 > save(adj.snpmat,file="adj.datamat.RData") 

The adj.datamat.RData file size is now 3.3 GB. I am confused why the two files are so different in size, while object.size () gives similar sizes. Any idea of ​​what determines the size of the stored object is welcome.

Additional Information:

 > typeof(snpmat) [1] "list" > class(snpmat) [1] "data.frame" > typeof(snpmat[,1]) [1] "integer" > typeof(snpmat[,2]) [1] "double" #This is true for all columns except column 1 > typeof(adj.snpmat) [1] "list" > class(adj.snpmat) [1] "data.frame" > typeof(adj.snpmat[,1]) [1] "character" > typeof(adj.snpmat[,2]) [1] "double" #This is true for all columns except column 1 
+6
source share
1 answer

Your matrices are very different and, therefore, are compressed very differently.

SNP data contains only a few values ​​(for example, 1 or 0) and is also very sparse. This means it is very easy to compress. For example, if you have a matrix of all zeros, you can think about data compression by specifying a single value (0), as well as sizes.

Your regression matrix contains many different types of values, as well as real numbers (I accept p values, coefficients, etc.). This makes it much less compressible.

0
source

Source: https://habr.com/ru/post/971670/


All Articles