Break a vector randomly into two sets

Question

Break a vector randomly into two sets

I have a vector t with a length of 100 and I want to divide it into 30 and 70 values, but the values should be selected randomly and without replacement. Thus, none of the 30 values can be included in the sub-vector of 70 values and vice versa.

I know the R sample function, which I can use to randomly select values from a vector with and without substitution. However, even when I use replace = FALSE, I have to run the sample function twice with 30 and once with 70 values to select. This means that some of the 30 values can be in 70 values and vice versa.

Any ideas?

+4

random r sample random-sample

user969113 Sep 04 '12 at 10:06

source share

4 answers

seancarmody · Answer 1 · 2012-09-04T10:20:24+0000

How about this:

 t <- 1:100 # or whatever your original set is a <- sample(t, 70) b <- setdiff(t, a)

Gavin simpson · Answer 2 · 2012-09-04T10:51:14+0000

As for my comment, what's wrong:

 vec <- 1:100 set.seed(2) samp <- sample(length(vec), 30) a <- vec[samp] b <- vec[-samp]

?

To show this separate sets without duplicates:

 R> intersect(a, b) integer(0)

If you have duplicate values in your vector, this is another matter, but your question is unclear.

With duplicates in vec everything is a little more complicated, and it depends on what result you wanted to achieve.

 R> set.seed(4) R> vec <- sample(100, 100, replace = TRUE) R> set.seed(6) R> samp <- sample(100, 30) R> a <- vec[samp] R> b <- vec[-samp] R> length(a) [1] 30 R> length(b) [1] 70 R> length(setdiff(vec, a)) [1] 41

Thus, setdiff() "does not work" here, since it does not have a length on the right, but then a and b contain duplicate values (but not observations! From the sample):

 R> intersect(a, b) [1] 57 35 91 27 71 63 8 92 49 77

Duplicates (intersection) occurs because the above values are repeated twice in the original vec sample

Jilber urbina · Answer 3 · 2012-09-04T10:20:35+0000

How about this?

 x <- 1:100 s70 <- sample(x, 70, replace=FALSE) s30 <-sample(setdiff(x, s70), 30, replace=FALSE)

s30 will have the same numbers as setdiff(x, s70) , the difference between them: s30 unordered vector of length 30 and setdiff(x, s70) will give you a (increasing) ordered vector of length 30. You said you want random sub-samples of length 70 and 30, so s30 better than just setdiff(x, s70) . If ordering doesn't really matter, so a better alternative would be to use setdiff without sample , as in @seancarmody's answer.

A5C1D2H2I1M1N2O1R2T1 · Answer 4 · 2012-09-04T10:26:12+0000

As you mentioned split, you can also try something like this:

 set.seed(1) t <- sample(20:40, 100, replace=TRUE) groups <- rep("A", 100) groups[sample(100, 30)] <- "B" table(groups) # groups # AB # 70 30 split(t, groups) # $A # [1] 25 32 39 24 38 39 33 21 24 23 36 40 27 36 24 33 22 25 28 28 38 27 30 30 23 # [26] 34 35 37 33 31 36 20 30 35 34 30 29 25 22 26 33 28 26 29 26 33 30 36 21 38 # [51] 27 37 27 27 30 38 38 36 29 34 28 26 35 25 23 25 21 33 36 28 # # $B # [1] 27 33 34 28 30 35 39 20 32 37 36 22 28 36 31 38 21 30 39 25 28 40 24 34 22 # [26] 38 36 29 37 32

Break a vector randomly into two sets

More articles: