A subset of the first 500 rows by group, for a subset of groups

Question

A subset of the first 500 rows by group, for a subset of groups

This should be a simple answer. I want to multiply my data for testing. I have a data frame where I want to store all columns of information, I just simply reduce the number of PER observations separately. So, I have a unique identifier and about 50 people. I want to select only 2 people AND, and I want to select only 500 data points from these 2.

My data frame is called wloc08 . There are 50 unique identifiers. I take only 2 of these people, but 2 of them, I would like only 500 data points from each.

 subwloc08=subset(wloc08, subset = ID %in% c("F07001","F07005"))

somewhere in this expression i can use [ ?

  reduced= subwloc08$ID[1:500,]

Does not work.

+4

r data.table subset

Kerry 18 sept. '12 at 9:08

source share

2 answers

If you are dealing with only two people, you can leave with the subset separately, and then rbind for each subset:

 wloc08F07001 <- wloc08[which(wloc08$ID == "F07001")[1:500], ] wloc08F07005 <- wloc08[which(wloc08$ID == "F07005")[1:500], ] reduced <- rbind(wloc08F07001, wloc08F07005)

To make this more general, especially if you are dealing with large amounts of data, you can consider the data.table package. Here is an example

 library(data.table) wloc08DT<-as.data.table(wloc08) # Create data.table setkey(wloc08DT, "ID") # Set a key to subset on # EDIT: A comment from Matthew Dowle pointed out that by = "ID" isn't necessary # reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500], by = "ID"] reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500]]

To break down the syntax of the last step:

c("F07001", "F07005") : this will multiply your data by finding all the rows where the key is F07001 or F07005 . It will also initiate "no help" (see ?data.table for details)
.SD[1:500] : This will multiply the .SD object (a subset of the .table data) by selecting rows 1: 500.
EDIT This piece has been removed thanks to a fix by Matthew Dole. "By by by by" is initiated by step 1. Previously: ( by = "ID" : This tells [.data.table to perform the operation in step 2 for each identifier separately, in this case only the identifiers specified in step 1.)

+6

Benbarnes 18 sept. '12 at 9:32

source share

Sven hohenstein · Accepted Answer · 2012-09-18T09:21:11+0000

You can use lapply :

 do.call("rbind", lapply(c("F07001", "F07005"), function(x) wloc08[which(wloc08$ID == x)[1:500], ]))

Your command reduced = subwloc08$ID[1:500,] does not work, because subwloc08$ID is a vector. However, reduced = subwloc08$ID[1:500] would work, but would return the first 500 values of subwloc08$ID (and not whole lines of subwloc08 ).

If you want to run this command for the first 30 items, you can use unique(wloc08$ID)[1:30] instead of c("F07001", "F07005") :

 do.call("rbind", lapply(unique(wloc08$ID)[1:30], function(x) wloc08[which(wloc08$ID == x)[1:500], ]))

A subset of the first 500 rows by group, for a subset of groups

More articles: