Randomly select columns based on a group of columns

Question

Randomly select columns based on a group of columns

I have a simple problem that can be solved in a dirty way, but I'm looking for a clean way using data.table

I have the following data.tablewith columns nbelonging to several unequal groups. Here is an example of my data. Table:

dframe   <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters


           A           A          A           A           A          A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069  0.1165187
2 -1.5891905 -0.44468389 -0.1186977  0.02270782 -0.64950716 -0.6844163
          A         A          A          A         B         B          B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272  0.8458673
2 -1.644389 0.6360258  0.5612634  0.3559574 1.9658743  1.858222 -1.4502839
           B          B          B         B          B           B          B
1  0.3167216 -0.2919079  0.5146733 0.6628149  0.5481958 -0.01721261 -0.5986918
2 -0.8104386  1.2335948 -0.6837159 0.4735597 -0.4686109  0.02647807  0.6389771
           B          B           B          B          C           C
1 -1.2980799  0.3834073 -0.04559749  0.8715914  1.1619585 -1.26236232
2 -0.3551722 -0.6587208  0.44822253 -0.1943887 -0.4958392  0.09581703
           C          C          C         C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2  0.1680119 -0.5990310  0.9779425 1.0819789

What I want to do is to take an arbitrary subset of columns (of a separate size), keeping the same number of columns per group (if the selected sample size is larger than the number of columns belonging to one group, take all the columns of this group).

I tried an updated version of the method mentioned in this question:

subgroup row fetch from dataframe with dplyr

but I cannot match column names with argument by.

Can someone help me?

+4

r data.table

ifreak 14 . '17 11:21

2

, dplyr, lapply:

dframe   <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters

# Number of columns to sample per group
nc <- 8


res <- do.call(cbind,
       lapply(unique(colnames(dframe)),
              function(x){
                         dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
                         }
))

, , nc, nc, nc.

, gsub :

colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

0

Val 14 . '17 11:45

docendo discimus · Accepted Answer · 2017-06-14T12:09:47+0000

, IIUC:

idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))

dframe[, keep]

:

idx
# $A
# [1]  1  2  3  4  5  6  7  8  9 10
# 
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 
# $C
# [1] 25 26 27 28 29 30

pmin(7, lengths(idx))
#[1] 7 7 6

() idx, Map. , .

Randomly select columns based on a group of columns

More articles: