R: Condensed Indices

I have a vector like the following:

xx <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1) 

I want to find indexes that have them, and combine them. In this case, I want the result to look like 1 6 and 11 14 in a 2x2 matrix. My vector is actually very long, so I can not do it manually. Can anyone help me with this? Thanks.

+4
source share
4 answers

Since the question originally had the bioinformatics tag, I mentioned the Bioconductor IRanges package (and this is the companion for ranges in GenomicRanges genomes)

 > library(IRanges) > xx <- c(1,1,1,1,1,1,0,0,0,0,1,1,1,1) > sl = slice(Rle(xx), 1) > sl Views on a 14-length Rle subject views: start end width [1] 1 6 6 [1 1 1 1 1 1] [2] 11 14 4 [1 1 1 1] 

which could be forced into a matrix, but this would often not be convenient for any next step

 > matrix(c(start(sl), end(sl)), ncol=2)   [,1] [,2] [1,]   1   6 [2,]  11  14 

Other operations may begin with Rle , for example,

 > xx = c(2,2,2,3,3,3,0,0,0,0,4,4,1,1) > r = Rle(xx) > m = cbind(start(r), end(r))[runValue(r) != 0,,drop=FALSE] > m [,1] [,2] [1,] 1 3 [2,] 4 6 [3,] 11 12 [4,] 13 14 

See ?Rle man page for full Rle class Rle ; go from such a matrix, as indicated above, to the new Rle, as indicated in the comment below, you can create a new Rle of the appropriate length, and then assign the subset using IRanges as an index

 > r = Rle(0L, max(m)) > r[IRanges(m[,1], m[,2])] = 1L > r integer-Rle of length 14 with 3 runs Lengths: 6 4 4 Values : 1 0 1 

One could expand this to a full vector

 > as(r, "integer") [1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1 

but often it’s better to continue the analysis on Rle. The class is very flexible, so one of the ways to go from xx to the integer vector 1 and 0 is

 > as(Rle(xx) > 0, "integer") [1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1 

Again, it often makes sense to stay in the Rle space. And Arun's answer to your separate question is probably best.

Performance (speed) is important, although in this case I believe that the Rle class provides more flexibility that will affect poor performance, and getting into the matrix is ​​an unlikely endpoint for a typical analysis. Nonetheles Infrastructure IRanges Effective

 eddi <- function(xx) matrix(which(diff(c(0,xx,0)) != 0) - c(0,1), ncol = 2, byrow = TRUE) iranges = function(xx) { sl = slice(Rle(xx), 1) matrix(c(start(sl), end(sl)), ncol=2) } iranges.1 = function(xx) { r = Rle(xx) cbind(start(r), end(r))[runValue(r) != 0, , drop=FALSE] } 

with

 > xx = sample(c(0, 1), 1e5, TRUE) > microbenchmark(eddi(xx), iranges(xx), iranges.1(xx), times=10) Unit: milliseconds expr min lq median uq max neval eddi(xx) 45.88009 46.69360 47.67374 226.15084 234.8138 10 iranges(xx) 112.09530 114.36889 229.90911 292.84153 294.7348 10 iranges.1(xx) 31.64954 31.72658 33.26242 35.52092 226.7817 10 
+5
source

Something like this, maybe?

 if (xx[1] == 1) { rr <- cumsum(c(0, rle(xx)$lengths)) } else { rr <- cumsum(rle(xx)$lengths) } if (length(rr) %% 2 == 1) { rr <- head(rr, -1) } oo <- matrix(rr, ncol=2, byrow=TRUE) oo[, 1] <- oo[, 1] + 1 [,1] [,2] [1,] 1 6 [2,] 11 14 

This editing takes into account the cases when 1) the vector starts with "0" and not with "1" and 2), where the number of consecutive occurrences of 1 is odd / even. For ex: xx <- c(1,1,1,1,1,1,0,0,0,0) .

+5
source

Another short:

 cbind(start = which(diff(c(0, xx)) == +1), end = which(diff(c(xx, 0)) == -1)) # start end # [1,] 1 6 # [2,] 11 14 

I tested on a very long vector and it is a bit slower than when using rle . But more readable IMHO. If speed was really a problem, you could also:

 xx.diff <- diff(c(0, xx, 0)) cbind(start = which(head(xx.diff, -1) == +1), end = which(tail(xx.diff, -1) == -1)) # start end # [1,] 1 6 # [2,] 11 14 
+3
source

Here's another solution based on the ideas of others, and a little shorter and faster:

 matrix(which(diff(c(0,xx,0)) != 0) - c(0,1), ncol = 2, byrow = T) # [,1] [,2] #[1,] 1 6 #[2,] 11 14 

I have not tested a non-basic solution, but here is a comparison of the basic ones:

 xx = sample(c(0,1), 1e5, T) microbenchmark(arun(xx), flodel(xx), flodel.fast(xx), eddi(xx)) #Unit: milliseconds # expr min lq median uq max neval # arun(xx) 14.021134 14.181134 14.246415 14.332655 15.220496 100 # flodel(xx) 12.885134 13.186254 13.248334 13.432974 14.367695 100 # flodel.fast(xx) 9.704010 9.952810 10.063691 10.211371 11.108171 100 # eddi(xx) 7.029448 7.276008 7.328968 7.439528 8.361609 100 
+1
source

Source: https://habr.com/ru/post/1486322/


All Articles