Break a vector randomly into two sets

I have a vector t with a length of 100 and I want to divide it into 30 and 70 values, but the values ​​should be selected randomly and without replacement. Thus, none of the 30 values ​​can be included in the sub-vector of 70 values ​​and vice versa.

I know the R sample function, which I can use to randomly select values ​​from a vector with and without substitution. However, even when I use replace = FALSE, I have to run the sample function twice with 30 and once with 70 values ​​to select. This means that some of the 30 values ​​can be in 70 values ​​and vice versa.

Any ideas?

+4
source share
4 answers

How about this:

 t <- 1:100 # or whatever your original set is a <- sample(t, 70) b <- setdiff(t, a) 
+7
source

As for my comment, what's wrong:

 vec <- 1:100 set.seed(2) samp <- sample(length(vec), 30) a <- vec[samp] b <- vec[-samp] 

?

To show this separate sets without duplicates:

 R> intersect(a, b) integer(0) 

If you have duplicate values ​​in your vector, this is another matter, but your question is unclear.

With duplicates in vec everything is a little more complicated, and it depends on what result you wanted to achieve.

 R> set.seed(4) R> vec <- sample(100, 100, replace = TRUE) R> set.seed(6) R> samp <- sample(100, 30) R> a <- vec[samp] R> b <- vec[-samp] R> length(a) [1] 30 R> length(b) [1] 70 R> length(setdiff(vec, a)) [1] 41 

Thus, setdiff() "does not work" here, since it does not have a length on the right, but then a and b contain duplicate values ​​(but not observations! From the sample):

 R> intersect(a, b) [1] 57 35 91 27 71 63 8 92 49 77 

Duplicates (intersection) occurs because the above values ​​are repeated twice in the original vec sample

+4
source

How about this?

 x <- 1:100 s70 <- sample(x, 70, replace=FALSE) s30 <-sample(setdiff(x, s70), 30, replace=FALSE) 

s30 will have the same numbers as setdiff(x, s70) , the difference between them: s30 unordered vector of length 30 and setdiff(x, s70) will give you a (increasing) ordered vector of length 30. You said you want random sub-samples of length 70 and 30, so s30 better than just setdiff(x, s70) . If ordering doesn't really matter, so a better alternative would be to use setdiff without sample , as in @seancarmody's answer.

+3
source

As you mentioned split, you can also try something like this:

 set.seed(1) t <- sample(20:40, 100, replace=TRUE) groups <- rep("A", 100) groups[sample(100, 30)] <- "B" table(groups) # groups # AB # 70 30 split(t, groups) # $A # [1] 25 32 39 24 38 39 33 21 24 23 36 40 27 36 24 33 22 25 28 28 38 27 30 30 23 # [26] 34 35 37 33 31 36 20 30 35 34 30 29 25 22 26 33 28 26 29 26 33 30 36 21 38 # [51] 27 37 27 27 30 38 38 36 29 34 28 26 35 25 23 25 21 33 36 28 # # $B # [1] 27 33 34 28 30 35 39 20 32 37 36 22 28 36 31 38 21 30 39 25 28 40 24 34 22 # [26] 38 36 29 37 32 
+1
source

Source: https://habr.com/ru/post/1432229/


All Articles