Sampling by coefficient in R

I have a dataset of 1000 rows with the following structure:

device geslacht leeftijd type1 type2 1 mob 0 53 C 3 2 tab 1 64 G 7 3 pc 1 50 G 7 4 tab 0 75 C 3 5 mob 1 54 G 7 6 pc 1 58 H 8 7 pc 1 57 A 1 8 pc 0 68 E 5 9 pc 0 66 G 7 10 mob 0 45 C 3 11 tab 1 77 E 5 12 mob 1 16 A 1 

I would like to make a sample of 80 lines, consisting of 10 lines with type 1 = A, 10 lines with type 1 = B, etc. Is there anyone who can help him?

+6
source share
3 answers

Base R Solution:

 do.call(rbind, lapply(split(df, df$type1), function(i) i[sample(1:nrow(i), size = 10, replace = TRUE),])) 

EDIT:

Other solutions suggested by @BrodieG

 with(DF, DF[unlist(lapply(split(seq(type), type), sample, 10, TRUE)), ]) with(DF, DF[c(sapply(split(seq(type), type), sample, 10, TRUE)), ]) 
+7
source

Here, how I would like to do this using data.table

 library(data.table) indx <- setDT(df)[, .I[sample(.N, 10, replace = TRUE)], by = type1]$V1 df[indx] # device geslacht leeftijd type1 type2 # 1: mob 0 45 C 3 # 2: mob 0 53 C 3 # 3: tab 0 75 C 3 # 4: mob 0 53 C 3 # 5: tab 0 75 C 3 # 6: mob 0 45 C 3 # 7: tab 0 75 C 3 # 8: mob 0 53 C 3 # 9: mob 0 53 C 3 # 10: mob 0 53 C 3 # 11: mob 1 54 G 7 #... 

Or a simpler version would be

 setDT(df)[, .SD[sample(.N, 10, replace = TRUE)], by = type1] 

Basically, we are a selection (with a replacement - as you have less than 10 rows in each group) from the row indices inside each type1 group, and then a subset of the data at that index


Similarly dplyr you can do

 library(dplyr) df %>% group_by(type1) %>% sample_n(10, replace = TRUE) 
+9
source

Another option in the R database:

 df[as.vector(sapply(unique(df$type1), function(x){ sample(which(df$type1==x), 10, replace=T) })), ] 
+5
source

Source: https://habr.com/ru/post/986725/


All Articles