How to remove columns from data.frame?

Not much "How are you ...?" but more "How are you ...?"

If you have a file, someone gives you 200 columns, and you want to reduce it to a few that you need for analysis, how do you do it? Does one solution provide an advantage over another?

Assuming we have a data frame with columns col1, col2 and col200. If you need only 1-100, and then 125-135 and 150-200, you can:

dat$col101 <- NULL dat$col102 <- NULL # etc 

or

 dat <- dat[,c("col1","col2",...)] 

or

 dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this 

or

 dat <- dat[,!names(dat) %in% c("dat101","dat102",...)] 

Anything else I'm missing? I know that this spectacle is subjective, but this is one of those little things where you can immerse yourself and start doing it in one way and get into the habit when there are more effective ways. Like this question about which .

EDIT:

Or is there an easy way to create a workable column name vector? name (dat) does not print them with commas between them, which you need in the above code examples, so if you print the names this way, you have places everywhere and you need to manually enter them with commas ... Is there a command that will give β€œcol1”, β€œcol2”, β€œcol3”, ... how is your conclusion so that you can easily capture what you want?

+31
r dataframe
Aug 16 '11 at 0:00 a.m.
source share
11 answers

I use the data.table := operator to delete columns immediately, regardless of the size of the table.

 DT[,coltodelete:=NULL] 

or

 DT[,c("col1","col20"):=NULL] 

or

 DT[,(125:135):=NULL] 

or

 DT[,(variableHoldingNamesOrNumbers):=NULL] 

Any solution using <- or subset will copy the entire table. data.table := operator simply modifies the internal vector of column pointers in place. Thus, this operation is (almost) instantaneous.

+49
Aug 16 '11 at 11:01
source share

To remove individual columns, I just use dat$x <- NULL .

To remove multiple columns, but less than 3-4, I will use dat$x <- dat$y <- dat$z <- NULL .

Moreover, I will use a subset with negative names (!):

 subset(mtcars, , -c(mpg, cyl, disp, hp)) 
+29
Aug 16 2018-11-11T00:
source share

For clarity, I often use the select argument in a subset . With new people, I learned that keeping # of the teams they need in order to keep them to a minimum helps adopt. As their qualifications increase, their coding ability will also be. A subset is one of the first commands that I show people when you need to select data within a given criterion.

Something like:

 > subset(mtcars, select = c("mpg", "cyl", "vs", "am")) mpg cyl vs am Mazda RX4 21.0 6 0 1 Mazda RX4 Wag 21.0 6 0 1 Datsun 710 22.8 4 1 1 .... 

I'm sure this will go slower than most other solutions, but I rarely go to the point where microseconds matter.

+9
Aug 16 2018-11-11T00:
source share

Use read.table with colClasses "NULL" instances to avoid creating them in the first place:

 ## example data and temp file x <- data.frame(x = 1:10, y = rnorm(10), z = runif(10), a = letters[1:10], stringsAsFactors = FALSE) tmp <- tempfile() write.table(x, tmp, row.names = FALSE) (y <- read.table(tmp, colClasses = c("numeric", rep("NULL", 2), "character"), header = TRUE)) xa 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 f 7 7 g 8 8 h 9 9 i 10 10 j unlink(tmp) 
+7
Aug 16 2018-11-11T00:
source share

For the types of large files that I usually get, I did not do this at all at R. I would use the cut on Linux to process the data before I got to R. This is not criticism from R, just the preference to use some very simple Linux tools such as grep, tr, cut, sort, uniq and sometimes sed and awk (or Perl) when something needs to be done with regular expressions.

Another reason for using standard GNU commands is that I can pass them back to the data source and ask them to pre-filter the data so that I don't get extraneous data. Most of my colleagues are competent in Linux, they know less R.

(Updated). The method that I would like to use in the near future is a pair of mmap with a text file and examine the data in place, and not read it at all in RAM. I did this with C, and it can be really fast.

+5
Aug 16 2018-11-11T00:
source share

Sometimes I like to do this with column identifiers.

 df <- data.frame(a=rnorm(100), b=rnorm(100), c=rnorm(100), d=rnorm(100), e=rnorm(100), f=rnorm(100), g=rnorm(100)) 

as.data.frame (names (DF))

  names(df) 1 a 2 b 3 c 4 d 5 e 6 f 7 g 

Removing columns "c" and "g"

 df[,-c(3,7)] 

This is especially useful if you have data.frames that are large or have long column names that you do not want to enter. Or the names of the columns that follow the pattern, because then you can use seq () to delete.

RE: Your edit

You do not need to put "around the line" or "," to create a character vector. I find this little trick convenient:

 x <- unlist(strsplit( 'A B C D E',"\n")) 
+3
Aug 16 2018-11-11T00:
source share

Just accesses editing.

@nzcoops, you don't need the column names in the comma delimited character. You think about it wrong. When you do

 vec <- c("col1", "col2", "col3") 

you create a symbol vector. , simply separates the arguments used by c() when defining this vector. names() and similar functions return a character vector of names.

 > dat <- data.frame(col1 = 1:3, col2 = 1:3, col3 = 1:3) > dat col1 col2 col3 1 1 1 1 2 2 2 2 3 3 3 3 > names(dat) [1] "col1" "col2" "col3" 

It is much easier and fewer errors to select from names(dat) elements than to process its output in a comma-separated string that you can cut and paste.

Let's say we need the col1 and col2 , a subset of names(dat) that only keep the ones we want:

 > names(dat)[c(1,3)] [1] "col1" "col3" > dat[, names(dat)[c(1,3)]] col1 col3 1 1 1 2 2 2 3 3 3 

You can do what you want, but R will always print the vector on the screen in quotation marks " :

 > paste('"', names(dat), '"', sep = "", collapse = ", ") [1] "\"col1\", \"col2\", \"col3\"" > paste("'", names(dat), "'", sep = "", collapse = ", ") [1] "'col1', 'col2', 'col3'" 

therefore, the latter may be more useful. However, now you need to cut and go from this line. It is much better to work with objects that return what you want and use standard subsets of routines to save what you need.

+1
Aug 16 2018-11-11T00:
source share

If you already have a name vector that can be created in several ways, you can easily use the subset function to save or delete an object.

 dat2 <- subset(dat, select = names(dat) %in% c(KEEP)) 

In this case, KEEP is a vector of column names that has been previously created. For example:

 #sample data via Brandon Bertelsen df <- data.frame(a=rnorm(100), b=rnorm(100), c=rnorm(100), d=rnorm(100), e=rnorm(100), f=rnorm(100), g=rnorm(100)) #creating the initial vector of names df1 <- as.matrix(as.character(names(df))) #retaining only the name values you want to keep KEEP <- as.vector(df1[c(1:3,5,6),]) #subsetting the intial dataset with the object KEEP df3 <- subset(df, select = names(df) %in% c(KEEP)) 

Result:

 > head(df) abcd 1 1.05526388 0.6316023 -0.04230455 -0.1486299 2 -0.52584236 0.5596705 2.26831758 0.3871873 3 1.88565261 0.9727644 0.99708383 1.8495017 4 -0.58942525 -0.3874654 0.48173439 1.4137227 5 -0.03898588 -1.5297600 0.85594964 0.7353428 6 1.58860643 -1.6878690 0.79997390 1.1935813 efg 1 -1.42751190 0.09842343 -0.01543444 2 -0.62431091 -0.33265572 -0.15539472 3 1.15130591 0.37556903 -1.46640276 4 -1.28886526 -0.50547059 -2.20156926 5 -0.03915009 -1.38281923 0.60811360 6 -1.68024349 -1.18317733 0.42014397 > head(df3) abce 1 1.05526388 0.6316023 -0.04230455 -1.42751190 2 -0.52584236 0.5596705 2.26831758 -0.62431091 3 1.88565261 0.9727644 0.99708383 1.15130591 4 -0.58942525 -0.3874654 0.48173439 -1.28886526 5 -0.03898588 -1.5297600 0.85594964 -0.03915009 6 1.58860643 -1.6878690 0.79997390 -1.68024349 f 1 0.09842343 2 -0.33265572 3 0.37556903 4 -0.50547059 5 -1.38281923 6 -1.18317733 
+1
Jul 21 '16 at 2:05
source share

From http://www.statmethods.net/management/subset.html

 # exclude variables v1, v2, v3 myvars <- names(mydata) %in% c("v1", "v2", "v3") newdata <- mydata[!myvars] # exclude 3rd and 5th variable newdata <- mydata[c(-3,-5)] # delete variables v3 and v5 mydata$v3 <- mydata$v5 <- NULL 

Thought it was really smart to make an "do not include" list

+1
Jan 06 '17 at 22:29
source share

May use setdiff function:

If more columns are stored than deleted: Suppose you want to delete . 2 columns say col1, col2 from data.frame DT; you can do the following:

 DT<-DT[,setdiff(names(DT),c("col1","col2"))] 

If you delete more columns than you save: Suppose you want to save only col1 and col2:

 DT<-DT[,c("col1","col2")] 
0
01 Oct '14 at 10:21
source share

Dplyr's select() function is powerful for a subset of columns. See ?select_helpers for a list of approaches.

In this case, when you have a common prefix and serial numbers for column names, you can use num_range :

 library(dplyr) df1 <- data.frame(first = 0, col1 = 1, col2 = 2, col3 = 3, col4 = 4) df1 %>% select(num_range("col", c(1, 4))) #> col1 col4 #> 1 1 4 

In general, you can use the minus sign in select() to delete columns, for example:

 mtcars %>% select(-mpg, -wt) 

Finally, to your question, "is there an easy way to create a workable vector of column names?" - yes, if you need to manually edit the list of names, use dput to get a comma-separated list, which you can easily manipulate:

 dput(names(mtcars)) #> c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", #> "gear", "carb") 
0
Jan 29 '17 at 3:43 on
source share



All Articles