Select the first row by group

Question

Select the first row by group

From such a data frame

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10]) test <- test[order(test$id), ] rownames(test) <- 1:10 > test id string 1 1 A 2 1 F 3 2 B 4 2 G 5 3 C 6 3 H 7 4 D 8 4 I 9 5 E 10 5 J

I want to create a new one with the first line of each id / string pair. If sqldf accepts the R code inside it, the query might look like this:

 res <- sqldf("select id, min(rownames(test)), string from test group by id, string") > res id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E

Is there any solution other than creating a new column like

 test$row <- rownames(test)

and run the same sqldf query with min (string)?

+60

r dataframe sqldf

dmvianna Nov 07 '12 at 22:45

source share

8 answers

What about

 DT <- data.table(test) setkey(DT, id) DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables that will return the first row using a key

 jdtu <- function() unique(DT)

I think that if you order test outside the test, then you can also remove the setkey and data.table from the test (since setkey basically sorts by id, just like order ).

 set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ] DT <- data.table(DT, key = 'id') ju <- function() test[!duplicated(test$id),] jdt <- function() DT[J(unique(id)),mult = 'first'] library(rbenchmark) benchmark(ju(), jdt(), replications = 5) ## test replications elapsed relative user.self sys.self ## 2 jdt() 5 0.01 1 0.02 0 ## 1 ju() 5 0.05 5 0.05 0

and with a lot of data

** Edit using a unique method **

 set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] DT <- data.table(test, key = 'id') test replications elapsed relative user.self sys.self 2 jdt() 5 0.09 2.25 0.09 0.00 3 jdtu() 5 0.04 1.00 0.05 0.00 1 ju() 5 0.22 5.50 0.19 0.03

A unique method is faster.

+12

mnel Nov 08

source share

A simple ddply option:

 ddply(test,.(id),function(x) head(x,1))

If speed is a problem, a similar approach can be used with data.table :

 testd <- data.table(test) setkey(testd,id) testd[,.SD[1],by = key(testd)]

or it can be significantly faster:

 testd[testd[, .I[1], by = key(testd]$V1]

+11

joran Nov 07 '12 at 10:50

source share

(1) SQLite has a built-in rowid pseudo-column, so this works:

 sqldf("select min(rowid) rowid, id, string from test group by id")

giving:

  rowid id string 1 1 1 A 2 3 2 B 3 5 3 C 4 7 4 D 5 9 5 E

(2) Also sqldf has the argument row.names= :

 sqldf("select min(cast(row_names as real)) row_names, id, string from test group by id", row.names = TRUE)

giving:

  id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E

(3) The third option, which mixes the elements of the two above, could be even better:

 sqldf("select min(rowid) row_names, id, string from test group by id", row.names = TRUE)

giving:

  id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E

Note that all three of them rely on the SQLite extension for SQL, where using min or max guaranteed to cause other columns to be selected from the same row. (In other SQL databases, which may not be guaranteed.)

+7

G. Grothendieck Nov 08

source share

Now for dplyr add a separate counter.

 df %>% group_by(aa, bb) %>% summarise(first=head(value,1), count=n_distinct(value))

You create groups, they are combined into groups.

If the data is numeric, you can use:
first(value) [there is also last(value) ] instead of head(value, 1)

see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Full:

 > df Source: local data frame [16 x 3] aa bb value 1 1 1 GUT 2 1 1 PER 3 1 2 SUT 4 1 2 GUT 5 1 3 SUT 6 1 3 GUT 7 1 3 PER 8 2 1 221 9 2 1 224 10 2 1 239 11 2 2 217 12 2 2 221 13 2 2 224 14 3 1 GUT 15 3 1 HUL 16 3 1 GUT > library(dplyr) > df %>% > group_by(aa, bb) %>% > summarise(first=head(value,1), count=n_distinct(value)) Source: local data frame [6 x 4] Groups: aa aa bb first count 1 1 1 GUT 2 2 1 2 SUT 2 3 1 3 SUT 3 4 2 1 221 3 5 2 2 217 3 6 3 1 GUT 2

+5

Paul Paczuski Jul 29 '14 at 12:03

source share

I support the dplyr approach.

 library(dplyr) test %>% group_by(id) %>% filter(row_number()==1) # A tibble: 5 x 2 # Groups: id [5] id string <int> <fct> 1 1 A 2 2 B 3 3 C 4 4 D 5 5 E

Group by ID and filter to get only the first row. In some cases, identifiers may be required after group_by.

+5

atomman Jun 20 '18 at 18:43

source share

The basic R option is split() - lapply() - do.call() idiom:

 > do.call(rbind, lapply(split(test, test$id), head, 1)) id string 1 1 A 2 2 B 3 3 C 4 4 D 5 5 E

A more direct option is the lapply() function [ :

 > do.call(rbind, lapply(split(test, test$id), `[`, 1, )) id string 1 1 A 2 2 B 3 3 C 4 4 D 5 5 E

The wrist 1, ) at the end of the lapply() call is important because it is equivalent to calling [1, ] to select the first row and all columns.

+4

Gavin Simpson Nov 07

source share

 test_subset <- test[unique(test$id),]

Only this line will generate the desired subset.

-one

girl Jan 12 '15 at 14:10

source share

Joshua Ulrich · Accepted Answer · 2012-11-07 23:12

You can use duplicated to do this very quickly.

 test[!duplicated(test$id),]

Tests for speedy freaks:

 ju <- function() test[!duplicated(test$id),] gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1)) gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, )) jply <- function() ddply(test,.(id),function(x) head(x,1)) jdt <- function() { testd <- as.data.table(test) setkey(testd,id) # Initial solution (slow) # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)] # Faster options : testd[!duplicated(id)] # (1) # testd[, .SD[1L], by=key(testd)] # (2) # testd[J(unique(id)),mult="first"] # (3) # testd[ testd[,.I[1L],by=id] ] # (4) needs v1.8.3. Allows 2nd, 3rd etc } library(plyr) library(data.table) library(rbenchmark) # sample data set.seed(21) test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE)) test <- test[order(test$id), ] benchmark(ju(), gs1(), gs2(), jply(), jdt(), replications=5, order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 5 0.03 1.000 0.03 0.00 # 5 jdt() 5 0.03 1.000 0.03 0.00 # 3 gs2() 5 3.49 116.333 2.87 0.58 # 2 gs1() 5 3.58 119.333 3.00 0.58 # 4 jply() 5 3.69 123.000 3.11 0.51

Try it again, but only with first heat applicants and with more data and more repetitions.

 set.seed(21) test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE)) test <- test[order(test$id), ] benchmark(ju(), jdt(), order="relative")[,1:6] # test replications elapsed relative user.self sys.self # 1 ju() 100 5.48 1.000 4.44 1.00 # 2 jdt() 100 6.92 1.263 5.70 1.15

Select the first row by group

Edit

More articles: