Big data is converted to "transactions" from the arules package

The arules package in R uses class transactions. Therefore, to use the apriori() function, I need to convert the existing data. I have a matrix with 2 columns and approximately 1.6 mm rows and tried to convert the data as follows:

 transaction_data <- as(split(original_data[,"id"], original_data[,"type"]), "transactions") 

where original_data is my data matrix. Due to the amount of data, I used the largest Amazon AWS machine with 64GB of RAM. After a while I get

the vector exceeds the length limit of the vector in 'AnswerType'

Memory usage in the machine is still "only" at 60%. Is this an R-based constraint? Is there any way around this other than using sampling? Using only 1/4 of the data, the conversion worked fine.

Edit: As indicated, one of the variables was a factor instead of a symbol. After the change, the conversion was processed quickly and correctly.

+6
source share
1 answer

I suspect your problem is occurring because one of the functions uses integers (rather than, say, float) to index values. In any case, the size is not too large, so this is amazing. Maybe the data has another problem, for example, symbols as factors?

In general, I really recommend using memory mapped files via bigmemory , which you can also split and process using bigsplit or mwhich . If data offloading works for you, then you can also use a much smaller instance size and save $$. :)

+3
source

Source: https://habr.com/ru/post/896223/


All Articles