Is there a better way to get the same result as a table (vec) where vec is a vector?

Question

Is there a better way to get the same result as a table (vec) where vec is a vector?

Suppose I have a vector, and I do not know, a priori, its unique elements (here: 1 and 2).

vec <- c(1, 1, 1, 2, 2, 2, 2)

I was interested to know if there is a better way (or an elegant way) to get the number of unique elements in vec , i.e. the same result as table(vec) . It does not matter if its data.frame file or named vector has it.

 R> table(vec) vec 1 2 3 4

Reason: I was curious to find out if there is a better way. In addition, I noticed that there is a for loop in the base implementation (in addition to calling .C). I don’t know if this is a big problem, but when I do something like

 R> table(rep(1:1000,100000))

R takes a lot of time. I am sure this is due to the huge amount of 100,000. But is there any way to do this faster?

EDIT This is also a good job in addition to Chase's answer.

 R> rle(sort(sampData))

+4

r

suncoolsu Dec 20 '10 at 2:40

source share

1 answer

Chase · Accepted Answer · 2010-12-20T03:18:59+0000

This is an interesting problem - I am curious to see other thoughts about this. Looking at the source of table() , it shows that it is building tabulate() . tabulate() , apparently, has several features, namely that it deals only with positive integers and returns an integer vector without names. We can use unique() on our vector to apply names() . If you need to specify zero or negative values, I think going back and looking at table() necessary, since tabulate() does not seem to do this in the examples on the help page.

 table2 <- function(data) { x <- tabulate(data) y <- sort(unique(data)) names(x) <- y return(x) }

And a quick test:

 > set.seed(42) > sampData <- sample(1:5, 10000000, TRUE, prob = c(.3,.25, .2, .15, .1)) > > system.time(table(sampData)) user system elapsed 4.869 0.669 5.503 > system.time(table2(sampData)) user system elapsed 0.410 0.200 0.605 > > table(sampData) sampData 1 2 3 4 5 2999200 2500232 1998652 1500396 1001520 > table2(sampData) 1 2 3 4 5 2999200 2500232 1998652 1500396 1001520

EDIT: I just realized that plyr has a count() function, which is another alternative to table() . In the above test, it works better than table() , and a little worse than the hack-job solution I put together:

 library(plyr) system.time(count(sampData)) user system elapsed 1.620 0.870 2.483

Is there a better way to get the same result as a table (vec) where vec is a vector?

More articles: