R forming a sparse matrix

I have a large file with the following format, which I read as x

userid,productid,freq
293994,8,3
293994,5,3
949859,2,1
949859,1,1
123234,1,1
123234,3,1
123234,4,1
...

It gives the buyer a product and its frequency. I am trying to turn it into a matrix that gives all productid as columns and user IDs as rows with frequency value as record. Thus, the expected result

       1 2 3 4 5 8
293994 0 0 0 0 3 3
949859 1 1 0 0 0 0
123234 1 0 1 1 0 0

This is a sparse matrix. I tried to make table(x[[1]],x[[2]])one that works for small files, but tablean error appears outside the point

Error in table(x[[1]], x[[2]]) : 
 attempt to make a table with >= 2^31 elements
Execution halted

Is there any way to make this work? I'm on R-3.1.0, and it should support 2 ^ 51 vectors, so they get confused why it can't handle file size. I have 40MM lines with a total file size of 741M. thanks in advance

+4
2

data.table :

library(data.table)
library(reshape2)

# adjust fun.aggregate as necessary - not very clear what you want from OP
dcast.data.table(your_data_table, userid ~ productid, fill = 0L)

, .

+2

tidyr :

library(tidyverse)
library(magrittr)

# Replicate your example data
example_data <- matrix(
  c(293994,8,3,
    293994,5,3,
    949859,2,1,
    949859,1,1,
    123234,1,1,
    123234,3,1,
    123234,4,1),
  ncol = 3,
  byrow = TRUE) %>%
  as.data.frame %>%
  set_colnames(c('userid','productid','freq'))

# Convert data into wide format
spread(example_data, key = productid, value = freq, fill = 0)

spread , R table, , data.table, , tidyr/dplyr. , , data.table dcast . , -, , , , .

tidyr (2 mio records). . , ( rbind), ( rhadoop sparklyr).

, " " , - - .

# Make some random IDs
randomkey <- function(digits){
  paste(sample(LETTERS, digits, replace = TRUE), collapse = '')
}

products <- replicate(10, randomkey(20)) %>% unique
customers <- replicate(500000, randomkey(50)) %>% unique

big_example_data <- data.frame(
  useruid = rep(sample(customers, length(customers), replace = FALSE), 4),
  productid = sample(products, replace = TRUE),
  freq = sample(1:5)
)
# 2 mio rows of purchases
dim(big_example_data)
# With useruid, productid, freq
head(big_example_data)

# Test tidyr approach
system.time(
  big_matrix <- spread(big_example_data, key = productid, value = freq, fill = 0)
)
0

Source: https://habr.com/ru/post/1545912/


All Articles