I am trying to conduct an R-analysis of randomization in R on a wide genetic dataset (662 x 35350). All variables, except the result, are numerical, and 99% of them are binary 0/1. I am very familiar with R randomForest (), but only worked with datasets with 5000-10000 variables before. The next planned analysis step will consist of an extremely large data set with millions of variables, so I am motivated to find a solution to this problem.
I understand that R randomForest does not have an inherent restriction on the number of variables, and I know that I read a published work with variables of 100,000. When I try to analyze the complete data set (setting ntree = 100, I get: "Error: protect ( ): overflow protection stack "
This is true whether the dataset is a data framework (as it was originally provided) or when I convert it to a matrix. When I send the launch to the cluster for parallel processing, I see that all my kernels work as soon as I run the code. I also see that in no case does my use of RAM fit the machine limit (48 GB). In the best case, this takes up about 16% of RAM during the attempted execution. (I also had the same problem on my 512 GB of RAM in the office, where it never used more than about 5%).
I tried several solutions found on the Internet, including one in the previous postoverflow post ( Increase (or decrease) memory available for R processes ). I tried the instructions provided by BobbyShaftoe in 2009 (adding - max-mem-size = 49000M and -max-vsize = 49000M in the shortcut tab properties), but this prevented R from opening correctly. I also tried to run these commands on the command line, but they are generated: '--max-ppsize' / '- max-vsize = 5000M "is not recognized as an internal or external command, operating program or batch file.
I also read the suggestions made in this post: How to improve RandomForest performance? . I cannot reduce the number of functions until I have at least one full-featured launch. (Also, I'm not sure if the problem is in RAM, as such.)
I am on Windows 7 running Revolution R 7.2 (64-bit). My memory limit is set at 49807 MB, but I'm not sure that memory.limit specifically allows for the size of the security stack.
Destroying the data set into smaller pieces of variables (which solves the problem) does not solve the analytical problem. Are there any suggestions regarding R settings that may allow analysis of the full data set?
See sessionInfo ()
#Language: R