I have data in a normalized, neat “long” data structure that I want to upload to H2O and, if possible, parse on (or have a final decision that I need more hardware and software than is currently available) . The data is large, but not huge; perhaps 70 million rows of 3 columns in their normalized effective form and 300 kbytes for 80 thousand when it was thrown into a sparse matrix (most of the cells are zero).
Analytical tools in H2O need to be in the latest, wide format. Part of the overall motivation is to see where various hardware settings are limited by analyzing such data, but at the moment I'm struggling to get the data into an H2O cluster (on a machine where R can store everything in RAM) so I can’t judge the size of the restrictions for analysis.
The test data is similar to the following, where the three columns are “documentID”, “wordID” and “count”:
1 61 2
1 76 1
1 89 1
1 211 1
1 296 1
1 335 1
1 404 1
It doesn’t matter, because it’s not even a real data set for life, but just a set of tests - these test data are taken from https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ docword.nytimes.txt.gz (caution, large load).
, , ID, - ( ). R () tidyr::spread ( , spread, ) tidytext::cast_sparse, , , R.
H2O ( h2o.ai, CRAN) R as.h2o, , , (, 3500 7000 3 , 22 ), 300 000 80 000, :
asMethod (): " " "cholmod". /Core/cholmod _dense.c, 105
, :
- , , H2O "" H2O.
- R ( ), H2O
, H2O # 1, .. tidytext::cast_sparse tidyr::spread R. . , , - ? , ( ) : ( ) H2O "" "" ?.
№ 2 , , SVMlight. , , , SVMlight , (, ). , MatrixMarket, Matrix R, , H2O. MatrixMarket , , colno rowno cellvalue ( ).